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Chapter 1 
Introduction 



This document presents the research I have undertaken since the beginning of 
my PhD thesis. The Laboratoire de Probabilites et Modeles Aleatoires of Uni- 
versite Paris 6 hosted my PhD (2001-2004). I was then recruited in the Cen- 
tre d'Enseignement et de Recherche en Traitement de I'lnformation et Signal de 
I'Ecole Nationale des Ponts et Chaussees, which is now a common research lab- 
oratory with the Centre Scientifique et Technique du Batiment and part of the 
Laboratoire d'informatique Gaspard Monge de I'Universite Paris Est. Besides, 
since 2007, part of my research has been done within the Willow team of the 
Laboratoire d'informatique de I'Ecole Normale Superieure. 

My main research directions are statistical learning theory and machine learn- 
ing techniques for computer vision. Machine learning is a research field posi- 
tioned between statistics, computer science and applied mathematics. Its goal is 
to bring out theories and algorithms to better understand and deal with complex 
systems for which no simple, accurate and easy-to-use model exists. It has a con- 
siderable impact on a wide variety of scientific domains, including text analysis 
and indexation, financial market analysis, search engines, bioinformatics, speech 
recognition, robotics, industrial engineering... The development of new sensors 
to acquire data, the increasing capacity of storage and computational power of 
computers have brought new perspectives to understand more and more complex 
systems from observations. In particular, Machine learning techniques are used in 
computer vision tasks that are unsolvable using classical methods (object detec- 
tion, handwriting recognition, image segmentation and annotation). 

The core problem in statistical learning can be formalized in the following 
way. We observe n input-output (or object-label) pairs: Zi = {Xi, Fi ),..., Z„ = 
(Xn.Yn). A new input X comes. The goal is to predict its associated output 
Y. The input is usually high dimensional and highly structured (such as a digital 
image). The output is simple: it is typically a real number or an element in a finite 
set (for instance, 'yes' or 'no' in the case of the detection of a specific object in 
the digital image). The usual probabilistic modelling is that the observed data (or 
training set) and the input-output pair Z = {X, Y) are independent and identically 
distributed random variables coming from some unknown distribution P, and that, 
for various possible reasons, the output is not necessarily a deterministic function 
of the input. 

The lack of quality of a prediction y' when y is the true output is measured by 
its loss, denoted £{y, y'). Typical loss functions are the 0/1 loss: C\y, y') = l^y^y' 
(the loss is one if and only if the prediction differs from the true output) and the 
square loss: i{y,y') = {y — y'Y- The latter loss is more appropriate than the 
0/1 loss when the output space is the real line, a small difference between the 
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prediction and the true output generating a small loss. The target of learning is to 
infer from the training set a function g from the input space to the output space 
having a low risk, also called expected loss or generalization error: 

R{g)^E^x,Y)^pe{Y,g{X)). 

Statistical learning theory aims at answering the following questions. What are 
the conditions for (asymptotic) consistency of the learning scheme? What can we 
learn from a finite sample of observations? Under which circumstances, can we 
expect the risk to be close to the risk of the best prediction function, that is the one 
we could have proposed had we a full knowledge of the probability distribution P 
underlying the observations? How accurate is the prediction built on the training 
set? For instance, how low is its risk? What kind of guarantees can we ensure? 
Both theoretical and empirical (i.e., computable from the observed data) upper 
bounds on the risk or the excess risk are of interest. Can we understand/explain 
the success of some prediction schemes? Besides, we also expect that a new 
theoretical analysis leads to the design of new prediction methods. 

This document details my contributions to these issues, and specifically to: 

• the PAC-Bayesian analysis of statistical learning, 

• the three aggregation problems: given d functions, how to predict as well as 

- the best of these d functions (model selection type aggregation), 

- the best convex combination of these d functions, 

- the best linear combination of these d functions, 

• the multi-armed bandit problems. 

Being in computer science departments where image processing and computer 
vision are core research directions leads me to address a wide variety of topics in 
which machine learning plays a key role. It includes object recognition, content- 
based image retrieval, image segmentation and image annotation and vanishing 
point detection. This document will not detail my contributions on these topics. 
My related publications can be found on my webpage. 
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Chapter 2 

The PAC-Bayesian analysis of statistical learning 



2.1. Introduction 

The natural target of learning is to predict as well as if we had known the dis- 
tribution generating the input-output pairs. In other words, we want to infer from 
the training set = {(Xi, Yi), .... (X„, a prediction function g whose risk 
is close to the risk of the Bayes predictor g* = argmmgR(g), where the minimum 
is taken among all functions g : X ^ (such that i(Y,g{X)) is integrable). The 
goal is therefore to propose a good estimator g of g*, where the quality of the es- 
timator is not in terms of the functional proximity of the prediction functions but 
in terms of their risk similarity. 

Since the distribution P of the input-output pair is unknown, the risk is not 
observed, and numerous core learning procedures have recourse to its empirical 
counterpart: 

n 

r{g) = -J2m,9im, 
1=1 

either by minimizing it on a restricted class of functions, or almost equivalently 
by minimizing a linear combination of this empirical risk and a penalty (or reg- 
ularization) term whose role is to favor "simple" functions. The term "simple" 
typically refers to some a priori of the statistician, and is often linked to either 
some smoothness property or some sparsity of the prediction function. The tradi- 
tional approach to statistical learning theory relies on the study of it!(^) — r{g). 

In the PAC-Bayesian approach, randomized prediction schemes are consid- 
ered. Let M denote the set of distributions on the set S(X; of functions from 
the input space to the output space. A distribution p in M is chosen from the 
data, and the quantity of interest is R{g), where g is drawn from the distribution 
p. This risk is thus doubly stochastic: it depends on the realization of the training 
set (which is a realization of the n-fold product distribution P®" of P) and on the 
realization of the (posterior) distribution p. 

Basically, one can argue that the difference between the approaches seems 
minor: the understanding of ¥,g^^R{g) for any distribution p implies the under- 
standing of R{g) (simply by considering the Dirac distribution at g), and that the 
converse is also true (to the extent that if R{g) < B{g) holds for any estimator g 
and some real-valued function B, then Eg^pR{g) < 'Eg^pB{g) also holds for any 
posterior distribution p). 

The main difference lies rather in the very starting point of the PAC-Bayesian 
analysis. To detail it, let me introduce a distribution tt G M, that is non-random (as 
opposed to p, which depends on the sample). The central argument is (based on) 
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the following property of the KuUback-Leibler (KL) divergence: for any bounded 
function h : SCX; ^) — t- M, we have 

sup {Eg^Ma) - K{p,7r)} = logE,^^e'^(5), (2.1.1) 
peM 

where e denotes the exponential number, and K(p, vr) is the KL divergence be- 
tween the distributions p and vr: K{p, tt) = E^^p log (f ((?)) if p admits a density 
with respect to tt, denoted ^, and K{p, n) = +00 otherwise^. To control the dif- 
ference E,g^pR{g) — 'Eg^^r(g), putting aside integrability issues, one essentially 
uses: for any A > 0, 

= Ez.^p«.E3^,e^[^(3)-"(f'] 
= Eg^^E^n^p^ne^l^^^')-"^^)] 



Eg^jE^^,y)^pe^^W,)-i(yMX))] 



(2.1.2) 



A first consequence is that PAC-Bayes bounds are not (directly) useful for pos- 
terior distributions with K(p, vr) = +00: this is in particular the case when p 
is a Dirac distribution and tt assigns no probability mass to single functions. So 
classical results of the standard approach does not derive from the PAC Bayesian 
approach. On the other hand, the apparition of the KL term shows that the PAC- 
Bayesian analysis fundamentally differs from the simple analysis given in the pre- 
vious paragraph. 

To illustrate this last point, consider the case of a prior distribution putting 
mass on a finite set S C S(3C; y) of functions. For simplicity, consider bounded 
losses, say < i{y,y') < 1 for any y,y' E By using Hoeffding's inequality 
and a weighted union bound, one gets that for any e > 0, with probability at least 
1 — £, we have for any g E 5 



fito)-^to)<v '°^'';?'^"' 

hence for any distribution p such that p(S) = 1, 



Eg^pRig) - Eg^^rig) < E,.,1 ' 



2n 



'See Appendix IaI for a summary of the properties of the KL divergence. 



6 



where the second inequality uses Jensen's inequality and Shannon's entropy: H{p) = 
— X^geg Pid) ^^SPid)- This is to be compared to the first PAC-Bayesian bound 
from the pioneering work of McAUester [102J, which states that with probability 
at least 1 — e, for any distribution p E M, we have 

The main difference is that the Shannon entropy has been replaced with a log n 
term. In fact, the latter bound is not restricted to prior distributions putting mass 
on a finite set of functions: it is valid for any distribution vr. On the contrary, the 
basic argument leading to (12.1.31) does not extend to continuous set of functions 
because of the Shannon's entropy term (for p putting masses on a continuous set 
of functions, this term diverges). 

The previous discussion has shown the originality of the PAC-Bayesian anal- 
ysis. However it does not clearly demonstrate its usefulness. Several works in the 
last decade have shown that the approach is indeed useful, and that PAC-Bayesian 
bounds lead to tight bounds, which are often representative of the risk behaviour 
even for relatively small training sets (see e.g. If88l 11031 [82ll for margin-based 
bounds from Gaussian prior distributions, If83l for an Adaboost setting, that is 
majority vote of weak learners, II118II in a clustering setting, Q Chap.21, |f89l for 
compression schemes, [|50l [STIl for PAC bounds with sparsity-inducing prior dis- 
tributions). 

My contributions to the PAC-Bayesian approach are the use of relative PAC- 
Bayesian bounds to design estimators with minimax rates (Section [23l) . the com- 
bination of the PAC-Bayesian argument with metric and (generic) chaining ar- 
guments (Section the use of PAC-Bayesian bounds to propose new estima- 
tors and minimax bounds under weak assumptions for the aggregation problems 
(Chapter [3]). Before detailing them, I give in the next section a global picture of 
PAC-Bayesian bounds, with a particular emphasis on the relations between the 
different works since they have not been underlined so far in the literature. 

2.2. PAC-Bayesian bounds 

We consider that the losses are between and 1, unless otherwise stated. The 
symbol C will be used to denote a constant that may differ from line to line. The 
bounds stated here are the original ones, possibly up to minor improvements. Most 
of them rely on a different use of the duality formula (12.1.11) and the Markov in- 
equality, which allows to prove a Probably Approximatively Correct (PAC) bound 
from the control of the Laplace transform of an appropriate random variable: pre- 
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cisely, if a real-valued random variable V is such that Ee^ < 1, then for any 
e > 0, with probability at least 1 — e,V < log(£:~^). 

2.2.1. NON LOCALIZED PAC-Bayesian BOUNDS. McAUester's first bound 
states that for any e > 0, with probability at least 1 — e, for any p e M, we have 



Eg^,Rig) - Eg^^rig) < \l (McA) 



lK{p, tt) + log(2n) + log(£-i) 
2n - 1 

In [[871I117II . Seeger has proposed a simplified proof and improved the bound when 
the losses take only two values or 1 (classification losses). The result is that with 
probability at least 1 — e, for any p G M, we have 

K-fTW f \ iw ^ 7r) + log(2v/^£-^) 
K{Eg^pr{g),Eg^pR{g)) < . (S) 

where, with a slight abuse of notation, K(Eg^pr{g),Eg^pR{g)) denotes the KL 
divergence between the Bernoulli distributions of respective parameters Eg^pr(g) 
and Eg^pR(g). The concise proofs of (IMcAl) and dS]) are given in Appendices |B] 
andO 



Since we have E,^^i?((?)-E,^^r((?) < ^ K {Eg^pr{g),Eg^pR{g)) (Pinsker's 

inequality), ([S]) implies (IMcAl) . Besides, when Eg^pR{g) is small, ([S]) provides a 
much better bound than (IMcAl) since, from a cumbersome study of the function 

t ^ K(Eg^pr{g),Eg^pr{g) + t), © implies 



1^ ^/ X ^ / M 2Eg^pr(g)\l-E„^pr(g)]X AX 
\Eg^pR{g)-Eg^pr{g)\ < ^ ' " ^^'^ ^ ' " ^^^^ + — , (S') 

with % = K{p, vr) + \og{2^/n6^^). In particular, when the empirical risk of the 
randomized estimator is zero, this last bound is of 1/n order, while (IMcAl) only 
gives a 1 / ^/n order. 

Still in the classification setting, Catoni (40\ proposed a different bound: for 
any e > and A > with ^^I^(^) < 1, with probability at least 1 — £, for any 
P e M, 

Eg^pr{g) , K{p,n) + \og{e-^ 

^Q'^pj^yg) ^ ■ 

where 



T / N e* - 1 - t 

= 

Since typical values of A (the ones which minimizes the previous right-hand side) 
are in [Cy/n; Cn] and since \l'(A/ri) ^ 1/2 for \/n close to 0, we roughly have 



8 



which gives by choosing A optimall>@ 



¥.,^pR[g) < ^,^A9) + y 2E,.,r(^?):^^^l^^l±M^. (CI') 

My PhD thesis used in variant ways the following Bernstein's type PAC-Bayesian 
bound, which is a direct extension of the argument giving (ICll) : for any A > 0, 
with probability at least I — e, for any p G M, 

^ ir(p,7r)+log(e-i) ^ 
A 

The basic PAC-Bayesian bound used in Zhang's works [|1361I137]| does not require 
any boundedness assumption of the loss function and states that for any A > 0, 
with probability at least 1 — e, for any p G M, 

-^E,.,logE,e-^'<>-M-)) < H- ■ (Z) 

Catoni's book iHTI concentrates on the classification task. Instead of using 

logEe-^^(^'^™ < --R{g) + %m(^)R{g), 
which would give (ICll) from (jZ]), Catoni used the equality 

logEe-^'(^'^™ = log (1 - R{g){l - e"^)), 
and obtain that with probability at least 1 — £, for any p G M, 

-I log[l - (1 - e'^)E,..Ms)] < ^.-A3) + ^''^-"'V"^'^"' . (C2) 

To compare Seeger's bound with the bounds having the free parameter A in 
the classification framework, one needs to apply the same kind of analysis which 
leads from (ICTl) to Cl'D . As a result, both © and © lead to 



^Technically speaking, we are not allowed to choose A depending on p, but using a union 
bound argument, the argument can be made rigorous at the price that the log(£^^) term becomes 
log(Clog(C7i)£-i). 
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(ICTT) leads to 



dS]) gives, once more, from studying the function 1 1— K(E,g^pr(g),Eg^pr{g) + t). 



2Eg^pR{g)[l -Eg^pR{g)]% ^ 2% 
n 3n ' 



Eg^pRig) < Eg^prig) + \l ''''' ^'^'^ + 



with X = K{p, vr) + \og{2y/ne-^) , and finally Q) leads to 



Eg^pR{g) < Eg^,r{g) + J2(E,.,i?(<7)[l - E,.,i?(^?)]):^' + 



n 

Although we have Eg^pi?(^)[l-Eg^pi?(^)] > Eg^pR{g)[l-R{g)\ (from Jensen's 
inequality), the two quantities will be of the same order, and also of the order of 
Eg^pR{g) for the typical posterior distributions, i.e., the ones which concentrate 
on low risk functions. As a consequence, in the classification setting, all these 
bounds are similar (even if this similarity has not been exhibited so far in the 
literature). 

In fact, the works which lead to (ICTl) . ©, © and (IC2l) rather differ in the 
way these bounds are refined and used. The main common refinement is the 
PAC-Bayesian localization, which can be seen as a way to reduce the complexity 
term and the influence of the particular choice of the prior distribution vr. Before 
detailing the localization idea, let us see how to design an estimator from PAC- 
Bayesian bounds. 

2.2.2. From PAC-Bayesian bounds to estimators. The standard way to 
exploit an upper bound on the risk of any estimators is to minimize it in order to 
get the estimator having the best guarantee in view of the bound. This will be 
achievable if the bound is empirical, that is computable from the observations. 
Bounds (IMcAl) . (|Sl . (|CT]) . © and (|C2|) are of this type (unlike ^\ for instance). 

When minimizing PAC-Bayesian bounds, one gets a posterior distribution cor- 
responding to a randomized estimator. The minimizer can be written in the fol- 
lowing form 

"'^^^^^ = ■ "^^"^ 

for some appropriate function : S — t- R. This is essentially due to the equality 
argmin^g^tl - Eg^ph{g) + K{p, tt)} = tt/,. 

Let us now detail the case of McAUester's bound as it is representative of what 
can be derived from the other PAC-Bayesian bounds. Let B[p) = Egr^pr{g) + 
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Y ^(p'^)'^^^{^ne ^) ^ McAUester's bound implies that for any distribution p G M, 

we have Kg^pR^g) < B{p). From this, one can deduce that there exists A G 
[Ai, A2] s.t. B{n_-^^.) = mmpB{p) with Ai = A/4(2ra - 1) log(4ra£:-i) and A2 = 
2Ai + 4(2n — 1). Besides, the parameter A which can be called inverse temperature 
parameter by analogy with the Boltzmann distribution in statistical mechanics 
satisfies 



A = ^4(2n - l)[ir(7r_^„ tt) + log(4n£-i)] 
and A G argmin { - i log E,_e-^'(^) + + 

A>0 ^ ' J 

The posterior distribution is thus a distribution which concentrates on low em- 
pirical risk functions, but is still a bit diffuse since to avoid a high KL complexity 
term, the optimal parameter A cannot be larger than Cn. The next section shows 
how to reduce the complexity term by tuning the prior distribution. 

2.2.3. Localized PAC-Bayesian bounds. Without prior knowledge, one 
may want to choose a prior distribution vr which is rather "flat". Now for a 
particular choice of posterior distribution p, from the equality E,z^K{p,Tr) = 
Kz^K{p,E,zi^[p]) + K(Ez^[p],7c), the prior distribution (recall that it is not al- 
lowed to depend on the training set) which minimize the expectation of the KL 
divergence is lE^np, where the expectation is taken with respect to the training 
set distributiorO. Now using such a prior distribution does not lead to empirical 
bound. To alleviate this issue and since the typical posterior distributions have the 
form 7r_Ar for some A > (as seen in the previous section), one may consider the 
prior distribution 7r_^/j for some /3 > 0, use the expansion 

K{p, n.^n) = K{p, vr) + PEg^pR{g) + log (E,^^e-'^^(^)) , 

and obtain an empirical bound by controlling the last non-observable term by its 
empirical version. 

This leads to the following localized PAC-Bayesian bound which was obtained 
by Catoni in for any £ > 0, A > and ^ > such that {^z§^^(^) < 1, with 
probability at least 1 — e, for any p G M, we have 



^ As noted by Catoni, EzyK{p,Mzi^[p]) is exactly the mutual information of the random 
variable g drawn according to the posterior distribution p and the training sample Z". This makes 
a nice connexion between the learning rate of a randomized estimator and information theory. 
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The parameter ^ characterizes the localization. For ^ = 0, we recover (ICll) (up to 
a minor difference on the confidence level). For > 0, the KL term is (potentially 
much) smaller when considering the posterior distribution 7r_^j, with 7 > ,^A. 

We use similar ideas in the case of the comparison of the risks of two ran- 
domized estimators as we will see in Section [231 Zhang [|1361I137]| localizes by 
using with h{g) = a logE^e"'^^*^^'^*^"^)^ instead of t^-^r. The argument there is 
slightly different and does not lead to empirical bounds on the risk of the random- 
ized estimator with posterior distribution of the form 7r_Ar- Nevertheless, it was 
sufficient to prove tight theoretical bounds for this estimator in different contexts: 
density estimation, classification and least squares regression. 

Ambroladze, Parrado-Hemandez and Shawe-Taylor (611 proposed a different 
way to reduce the influence of a "flat" prior distribution. Their localization scheme 
is based on cutting the training set into two parts and learn from the first part 
the prior distribution to be used on the second part of the training set. Catoni 
[|4T| uses T^-n\og[i+{eP/-^-i)R\ obtain tighter localized bounds in the classification 
setting. Alquier ||4l|5l| uses 7r_^/j for general unbounded losses with application to 
regression and density estimation. 



2.3. Comparison of the risk of two randomized estimators 



2.3.1. Relative PAC-Bayesian bounds. My PhD (its second chapter) used 
relative bounds which compare the risk of two randomized estimators to design 
new (randomized) estimators. The rationale behind developing this type of bounds 
is that the fluctuations of R{g2) — R{gi) + r{gi) — r{g2) can be much smaller than 
the fluctuations of R{g2) —r{g2), and this can lead to significantly tighter bounds. 
Technically speaking, relative bounds are deduced from standard bounds by re- 
placing g by g X g, taking the loss i{y, {gi,g2){x)) = i{y,g2{x)) - i{y,gi{x)) 
(with a slight abuse of notation) and by considering product distributions on g x g, 
i.e. p = pi ® P2 with pi and p2 distributions on g(X; '^). This standard argument 
transforms into the following assertion holding for losses taking values in 
[0, 1]. For any A > and (prior) distributions tti and 712 in M, with probability at 
least 1 — £, for any pi G M and p2 E M, 

Eg^^p^R{g2)-Eg^^p^R{gi) < Eg.^^p^r{g2) - Eg^^f,^r{gi) 

+ Eg,^p,Eg,^p,Ez{[iiY, g,{X)) - i{Y, g2{X))f) 

^ i^(p2,7r2) + i^(pi,7ri)+log(e-^) ^ 
A 

Getting empirical relative bounds calls for controlling the variance term. This is 
achieved by plugging the following inequality, which holds with probability at 
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least 1 — e, into the previous one 



i=l 



Ay ir(p2,vr2) + K(pi,7r0+log(e-i) 



Now, the localization argument described in Section 12.2.31 no longer works as 
it would change the left-hand side of (12.3.11) into (1 + ,^2)IEg2~p2-^(5'2) ~ (1 ~ 
^i)Eg^^p^i?((7i) for some positive constants and and would therefore fail 
to produce relative bounds. To solve this issue, I proved the following uniform 
empirical upper bound on the KL divergence with respect to a localized prior: for 
any e > and < A < 0.19 n, with probability at least 1 — 2e, for any p E M, 
we have 

+ 21ogE,,^^_,^e'^^''2-^^"-[^(^-^^(^^«-^(^-^^(^>))l' + log(£-^), 

and get the following localized empirical PAC-Bayesian relative bound: for any 
A > and < Ai, A2 < 0.19 n, with probability at least 1 — £, 



E, 



R{g2)-Eg^^p^R{gi) < Eg.^^p^r{g2) - Eg^^p^r{gi 



1 " 

i=l 



+ b{X) 



+ logE 
+ logE 



(2.3.2) 



with a(A) = Avl/(A) (1 + A) and 6(A) = f [l + ^^^J {l + I)' 



2.3.2. From the empirical relative bound to the estimator. In view 
of Section [2.2. 2[ it is natural to concentrate our effort on Gibbs estimators of the 
form 7r_Ar for A > 0. Introduce for any < j < logn and e > 0, 



Xj = 0.19yne2 
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L = log[3 \o^{en)e~^] 
and for any < i < j < log n and e > 0, 

1 " 
1=1 

+ 6(A,)[e(z) + e(j) + 2L]. 

Inequality (12.3.21) implies that with probability at least 1 — £, for any < i < j < 
log n, we have 

^92~7r_;, ,,^(^2) - Eg,^^_^,^i?(^i) < Eg,^^_^,^r(^2) - Eg,^^_^_,r(^i) + S{i,j). 

This leads me to consider in the chapter 2 of my PhD thesis the following choice 
of the temperature/complexity parameter in the classification setting. 

Algorithm 1. Let m(0) = 0. For any k > 1, define \k-i = '^u{k-i) (^nd u{k) 
as the smallest integer i &]u{k — 1); logn] such that 

^g,^._,^A92) - E,,__.^_^^r(^?i) + S{u{k - 1), j) < 0. 

Classify using a function drawn according to the posterior distribution associated 
with the last u{k). 

This algorithm can be viewed in the following way: it "ranks" the estimator 
in the model by increasing complexity (if we consider that K^n^Xjr, t^) is the 
complexity of the estimator associated with n_Xjr), picks the "first" function in 
this list and takes at each step the function of smallest complexity such that its 
risk is smaller than the one at the previous step. This is possible since we have 
empirical relative bounds. Subsequently to this work, different iterative schemes 
based on empirical relative PAC-Bayesian bounds have been proposed Il4l[5ll4ni. 
The interest of the procedure lies in the following theoretical guarantee. 

Theorem 1 The iterative scheme is finite: there exists K E such that u{K) 
exists but not u{K + 1). With probability at least 1 — e, for any k G {1, . . . , K}, 
we have 

Eg^^ .^Rig) < E,__.^_^^i?(^?), 

and 



R{g) < mm ^ E ^,R{g) + C 



'3 



+ sup < VOg]S\g^r^T,_ ]^g^r^T,_ e 

Aj Q<i<j [ » ' 
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To illustrate this last theoretical guarantee, let us consider complexity and mar- 
gin assumptions similar to the ones used in the pioneering work of Mammen and 
Tsybakov (97). To detail these assumptions, let d be the (pseudo-)distance on 
g(X;y) defined by 

dig,,g,)=¥[g,iX)^g2iX)]. 

Let S C S(X; y)- For u > 0, the set !N C S(X; ^) is called a w-covering net of 
S if we have S = ^geT<{9' ^ S; d{g,g') < uj. Let H{u) denote the w-covering 
entropy, i.e. the logarithm of the smallest w-covering net of S- The complexity 
assumption is that there exist C > and g > such that H{u) < C'u^'' for any 
ti > 0. Let 

g* = argmin^ggi?(^). 

Without great loss of generality, we assume the existence of such a function. The 
margin assumption is that there exist c", C" > and k E [1, +oo] such that for 
any function g E 9, 

c''[R{g)-R{g*)]-- <F[g{X)^g*{X)]<C''[R{g)-R{g*)]K (2.3.3) 

For any k eW, introduce the uniform distribution on the smallest 2^'^ covering 
net. 

Theorem 2 For the prior distribution n = "^^kyi k{k+i) ' randomized estima- 
tor defined in Algorithm 1 (p]j4\i satisfies 

^g^^_.^^R{g) - R{g*) < Cn-^^^ , 

for some positive constant C. 

We also proved in [TTl Chap. 3, Theorem 3.3] that the right-hand side is the minimax 
optimal convergence rates under such assumptions. Since the algorithm does not 
require the knowledge of the margin parameter k, it is adaptive to this parameter. 

Note that Assumption (12.3.31) is stronger than the usual assumption as the lat- 
ter does not assume the left inequality. In fact, to achieve minimax optimal rates 
under the usual margin assumption, while still assuming polynomial covering en- 
tropies requires the chaining argument [7, Chap. 3]. This leads us to study how to 
combine the chaining argument with the PAC-Bayesian approach and make the 
connexion with majorizing measures from the generic chaining argument devel- 
oped by Femique and Talagrand [|120il . which we detail in the next section. 

2.4. Combining PAC-Bayesian and Generic Chaining Bounds 

There exist many different risk bounds in statistical learning theory. Each of 
these bounds contains an improvement over the others for certain situations or 
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algorithms. In [fTOll . Olivier Bousquet and I underline the links between these 
bounds, and combine several different improvements into a single bound. In par- 
ticular, we combine the PAC-Bayes approach with the optimal union bound pro- 
vided by the generic chaining technique developed by Fernique and Talagrand, in 
a way that also takes into account the variance of the combined functions. We 
also show how this connects to Rademacher based bounds. The interest in generic 
chaining rather than just Dudley's chaining Il55l comes from the fact that it cap- 
tures better the behaviour supremum of a Gaussian process [120]. In statistical 
learning theory, the process of interest and which is asymptotically Gaussian is 
g ^ R{g) -r{g). 

I hereafter give a simplified version of the main results of [[TOll . Let me first 
introduce the notation. We still consider a set S C S(X; '^), g* = argmin^ggi?((yf), 
and that losses take their values in [0,1]. We consider a sequence of nested par- 
titions (Aj)j(=f>} of the set S, that is (i) Aj is a partition of S either countable or 
equal to the set of all singletons of S, and (ii) the Aj are nested: each element of 
Aj^i is contained in an element of Aj, and Aq = {S}- For the partition Aj and 
for g E S, we denote by Aj(g) the unique element of Aj containing g. Given a 
sequence of nested partitions {Aj)j,=n, we can build a collection (Sj)j^fq of ap- 
proximating subsets of 9 in the following way: for each j E N, for each element 
A of Aj, choose a unique element of S contained in A and define Sj as the set of 
all chosen elements. We have 15*01 = 1 and denote by Pj{g) the unique element of 
Sj contained in Aj{g). Finally, we also consider that for each j E N, we have a 
distribution n^^^ on S at our disposal. 

Our bound will depend on the specific choices of the distributions n^^\ the 
nested partitions (Aj), the associated sequence of approximating sets (Sj), and 
the corresponding approximating functions Pj{g),g E 9- Denote 5g the Dirac 
measure on g. For a probability distribution p on S, define its j-th projection as 

[p]j = J2 P^'^^^9)K 

when Sj is countable and [p\j = p otherwise. For any e > and p E M, define 
the complexity of p at scale j by 

and introduce the average distance between the (j — l)-th and j-th projections by 

2),(p) =E,.,|^E^.p|£(F,b,((7)](X)) -£(r,b,_i(^?)](X))}' 

1 " 2 1 

+ ^Y.{^i^^^ip^^9)m^)-^{Y^,[p,-,{g)m))} \ 
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Theorem 3 If the following condition holds 

.lim snv{R{g)-R[p,U)]-r{9)+r[pjU)]]=^. a.s. (2.4.1) 

then for any < /3 < 0.7, with probability at least 1 — 5, for any p G M, we have 




(2.4.2) 

Assumption (|2.4.1I) is not very restrictive. For instance, it is satisfied when 
one of the following condition holds: 

• there exists J G N* such that Sj = S, 

• almost surely linij^+oo supggg ^^gx.yey \Ky^9{x)) - i{y,[pj{g)]{x))\ = 
(it is in particular the case when the bracketing entropy of the set 9 is finite 
for any radius and when the Sfs and pfs are appropriately built on the 
bracketing nets of radius going to when j — t- +oo). 

The bound (12.4.21) combines several previous improvements. It contains an 
optimal union bound, both in the sense of optimally taking into account the met- 
ric structure of the set of functions (via the majorizing measure approach) and 
in the sense of taking into account the averaging distribution. It is sensitive to 
the variance of the functions and consequently will lead to fast convergence rates 
(that is faster than 1 / y/n), under margin assumptions such as the ones considered 
in the works of Nedelec and Massart filOOfl or Mammen and Tsybakov ||971 . It 
holds for randomized classifiers but contrarily to usual PAC-Bayesian bounds, it 
remains finite when the averaging distribution is concentrated at a single predic- 
tion function. On the negative side, there still remains work in order to get a fully 
empirical bound (it is not the case here since Dj(p) is not observable) and to better 
understand the connection with Rademacher averages. 

Independently of the generic chaining argument, we use a carefully weighted 
union bound argument, which is at the origin of the log log term in (|2.4.2|) and 
leads to the following corollary of the main result in IfTOll . 

Theorem 4 For any e > 0, with probability at least 1 — e, for any p G M, we 
have 
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for some numerical constant C > / fTOl Section 4.3]. 

This result means that neither the log(n) term in (IMcAl) (pjS]) or the Shannon's 
entropy term in (12.1.31) (p© is needed if we are allowed to have a numerical factor 
slightly larger in front of the square root term. 
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Chapter 3 
The three aggregation problems 



3.1. Introduction 

Aggregation is about combining different prediction functions in order to get a 
better prediction. It has become popular and has been intensively studied these last 
two decades partly thanks to the success of boosting algorithms, and principally 
of the AdaBoost algorithm, introduced by Freund and Schapire [58]. These algo- 
rithms use linear combination of a large number of simple functions to provide a 
classification decision rule. 

In this chapter, we focus on the least squares setting, in which the outputs are 
real numbers and the risk of a prediction function : X — )• R is 

R{g)=nY-g{X)f. 

Our results are nevertheless of interest for classification also as any estimate of the 
conditional expectation of the output knowing the input leads by thresholding to 
a classification decision rule, and the quality of this plug-in estimator is directly 
linked to the quality of the least squares regression estimator (see |l53l Section 6.2], 
[fT6ll and specifically the comparison lemmas of its section 5, and also Il95ll27ll28l 
for consistency results in classification using other surrogate loss functions). 

Boosting type classification methods usually aggregate simple functions, but 
the aggregation is also of interest when some potentially complicated functions 
are aggregated. More precisely, when facing the data, the statistician has often to 
choose several models which are likely to be relevant for the task. These mod- 
els can be of similar structures (like embedded balls of functional spaces) or on 
the contrary of very different nature (e.g., based on kernels, splines, wavelets or 
on parametric approaches). For each of these models, we assume that we have a 
learning scheme which produces a 'good' prediction function in the sense that its 
risk is as small as the risk of the best function of the model up to some small ad- 
ditive tenr0. Then the question is to decide on how we use or combine/aggregate 
these schemes. One possible answer is to split the data into two groups, use the 
first group to train the prediction function (i.e. compute the estimator) associated 
with each model, and then use the second group to build a prediction function 
which is as good as (i) the best of the previously learnt prediction functions, (ii) 
the best convex combination of these functions or (iii) the best linear combination 
of these functions, in terms of risk, up to some small additive term. The three 
aggregation problems we will focus on in this chapter concern the second part 

'The learning procedure could differ for each model, or on the contrary, be the same but using 
different values of a tuning parameter. 
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of this scheme. The idea of mixing (or combining or aggregating) the estimators 
originally appears in lfTT0i r7T| [T32l[T33]l . 

We hereafter treat the initial estimators as fixed functions, which means that 
the results hold conditionally on the data set on which they have been obtained, 
this data set being independent of the n input-output observations Z". Specifi- 
cally, let gi,. . . ,gd be d prediction functions, with d>2. Introduce 

^Ms e argmin R{g), 

ge{gi,...,gd} 

9*c e argmin R{g), 

and 

gl G argmin R{g). 

96{Ej=ie,9j;9ieM,...,edeM} 

The model selection aggregation task (MS) is to find an estimator g based on the 
observed data for which the excess risk R{g) — R{gMs) guaranteed to be 
small. Similarly, the convex (resp. linear) aggregation task (C) (resp. (L)) is to 
find an estimator g for which the excess risk R{g) — R{gQ) (resp. R{g) — R{gl)) 
is guaranteed to be small. 

The minimax optimal rates of aggregation are given in II123II and references 
within. Under suitable assumptions, it is shown that there exist estimators ^fMs^ 
and such that 

ERigus) - Rigus) < C^min l\ , (3.1.1) 



ER{gi^)~R{gl)<Cmm 



n n 

d 

n 



where gi (and for d < n, gc) require the knowledge of the input distribution. We 
recall that C is a positive constant that may differ from line to line. Tsybakov 
ni23ll has shown that these rates cannot be uniformly improved in the following 
sense. Let cr > 0, L > and Let 7a,L be the set of probability distributions on 
X X M such that we almost surely have Y = g{X) + ^, with \\g\\oo < L, and ^ a 
centered Gaussian random variable independent of X and with variance a^. For 
appropriate choices of gi, gd, the following lower bounds hold: 



inf sup {ER{g) - R{gMs)} > C min 



9 P&a.L \ ^ 



\ogd 
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M sup {ER(g)-R(gi)}>Cnnn(J^^^±^,^,l 

a pej.^,^ y\ n n 

inf sup {ER(g) - R(gl)] > Cmin ( -, 1 

where the infimum is taken over all estimators. The three aggregation tasks have 
also been studied in the least squares regression with fixed design, where similar 
rates are obtained [|36l[50ll5T1l. 

This chapter will provide my contributions to the aggregation problems (in the 
random design setting) summarized as follows. 

• The expected excess risk ^R{g) — -R(5'ms) °f empirical risk minimizer 
on {^fi, . . . , go} (or its penalized variants) cannot be uniformly smaller than 

C \J^^^- Since the minimax optimal rate is this shows that these esti- 
mators are inappropriate for the model selection task (Section [3.2. II) . 

• Catoni [[39ll and Yang [I131II have independently shown that the optimal rate 

in the model selection problem is achieved for the progressive mixture 
rule. In dH, I provide a variant of this estimator coming from the field of 
sequential prediction of nonrandom sequences, and called the progressive 
indirect mixture rule. It has the benefit of satisfying a tighter excess risk 
bound in a bounded setting (outputs in [—1, 1]). I also study the case when 
the outputs have heavy tails (much thicker than exponential tails), and show 
how the noise influences the minimax optimal convergence rate. I also pro- 
vide refined lower bounds of Assouad's type with tight constants (Section 
[I2J). 

• In [El, I show a limitation of the algorithms known to satisfy (13.1.11) : despite 
having an expected excess risk of order \/n (if we drop the dependence in 
d), the excess risk of the progressive (indirect or not) mixture rule suffers 
deviations of order (Section [3.2.3l) . 

• This last result leads me to define a new estimator g which does not suffer 
from this drawback: the deviations of the excess risk ^R{g) — R{gl/is) °f 
order ^ (Section [XMl). 

• In my PhD (its first chapter), I provide an estimator g based on empirical 
bounds of any aggregation procedures for which with high probability 



m 




always, 

if R{gl,^) = R{gl 
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This means that for n^^^ < < e" with 5 > 0, the estimator has the 
minimax optimal rate of task (C), and is adaptive to the extent that it has 
also the minimax optimal rate of task (MS) when -R((?ms) = -^(fi'c) (Section 
[331). 

• Finally, Olivier Catoni and I provide minimax results for (L), and con- 
sequently also for (C) when d < y/n. The strong point of these results 
is that it does not require the knowledge of the input distribution, nor uni- 
formly bounded exponential moments of the conditional distribution of the 
output knowing the input and has no extra logarithmic factor unlike previous 
results. In particular, provided that we know H and a such that Hs'lIIoo < H 
and sup^.gxIE{[F — 5'£(X)]^|X = x] < cr^, we propose an estimator g 
satisfying ER{g) - R{gl) < 68{a + Hf ^ (Section [M]). 

I should conclude this introductory section by emphasizing that we will not 
assume that the regression function g^'''^^^ : x i— )■ E(F|X = x), which minimizes 
the risk functional, is in the linear span of {(^i, . . . , ga}. This means that bounds 
of the form 

ER{g) - R{g*) < c[R{g^''°^^) - R{g*)] + residual term, (3.1.2) 

with c > 1 are not of interest in our setting as they would not provide the mini- 
max learning rate when R(g^'^'^^^) ^ R{g*)- 

3.2. Model selection type aggregation 

3.2.1. SUBOPTIMALITY OF EMPIRICAL RISK MINIMIZATION. Any empirical 
risk minimizer and any of its penalized variants are really poor algorithms in 
this task since their expected convergence rate cannot be uniformly faster than 
a/ {log d)/n. The following lower bound comes from [8] (see [|92ll . Il39l p. 14], 
[|90l l72l [T06l for similar results and variants). 

Theorem 5 For any training set size n, there exist d prediction functions gi, . . . ,gd 
taking their values in [—1, 1] such that for any learning algorithm g producing a 
prediction function in {gi, . . . , g^} there exists a probability distribution generat- 
ing the data for which Y G [—1,1] almost surely, and 

ERig) - R{gl,,) > min {^^^, l) , 

^These last bounds, which are relatively common in the literature, are nonetheless useful in a 
nonparametric setting in which the statistician is allowed to take {gi, . . . , gd} large enough so that 
R{g^'^''^^) — R{g*) is of the same order as the residual term. 
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where [logg d\ denotes the largest integer smaller or equal to the logarithm in 
base 2 of d. 

3.2.2. Progressive indirect mixture rules. The result of the previous 
section shows that, to obtain the minimax optimal rate given in (13.1.11) . an estima- 
tor has to look at an enlarged set of prediction functions. Until our work, the only 
known optimal estimator was based on a Cesaro mean of Bayesian estimators, 
also referred to as progressive mixture rule. 

To define it, let vr be the uniform distribution on the finite set {gi, . . . , gd]- For 
any i G {0, . . . , n}, the cumulative loss suffered by the prediction function g on 
the first i pairs of input-output, denoted Z\ for short, is 

i 

i:.,{g) = Y}yk-g{Xu)]\ 

k=l 

where by convention we take Sq identically equal to zero. Let A > be a parame- 
ter of the estimator. Recall that 7r_AE, is the distribution on {(?!, . . . , g^} admitting 
a density with respect to tt that is proportional to e~^^\ 

The progressive mixture rule (PM) predicts according to Yll=id ^9'^t^-\t.- 9- 
In other words, for a new input x, the predicted output is 

A specificity of PM is that its proof of optimality is not achieved by the most 
prominent tool in statistical learning theory: bounds on the supremum of empirical 
processes (see I1125II . and refined works as [26».80,i99l[34ll and references within). 
The idea of the proof, which comes back to Barron Il24l . is based on a chain rule 
and appeared to be successful for least squares and entropy losses [|38l [391 1251113 111 
and for general loss in 117211 . 

Here my first contribution was to take ideas coming from the field of sequential 
prediction of nonrandom sequences (see e.g. [107ll46l for a general overview and 
Il65l l44l l45l II 3411 for more specific results with sharp constants) and propose a 
slight generalization of progressive mixture rules, that I called progressive indirect 
mixture rules. 

The progressive indirect mixture rule (PIM) is also parameterized by a real 
number A > 0, and is defined as follows. For any i E {0, . . . , n}, let hi be a 
prediction function such that 

[Y - k{X)]' < -\ logE,__,,. e-^[^-^™' a.s. (3.2.1) 
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If one of the hi does not exist, the algorithm is said to fail. Otherwise it predicts 
according to ^ ^"^0 k. 

This estimator is a direct transposition from the sequential prediction algo- 
rithm proposed and studied in HI 26 [[651 1 127 1 to our "batch" setting. The functions 
hi do not necessarily exist, but are also not necessarily unique when they exist. 
A technical justification of ([3.2. 1[) comes from the analysis of PM synthetically 
written in Appendix [Dl 

When max \gi{X)\, . . . , \gd{X)\) < B a.s. for some i? > and for A 
large enough, the functions hi exist (so the algorithm does not fail). Still in this 
uniformly bounded setting, it can be shown that PM is a PIM for A large enough. 
On the other hand, there exists A > small enough for which the algorithm 
does not fail and such that PM is not a particular case of PIM, that is one cannot 
take hi = Eg^^_^j, g to satisfy (|3.2.1[) (see [[65l Example 3.13]). In fact, it is 
also shown there that PIM will not generally produce a prediction function in 
the convex hull of {gi, . . . , go] unlike PM. The following amazingly sharp upper 
bound on the excess risk of PIM holds. 

Theorem 6 Assume that \Y\ < 1 a.s. and \\gj\\oo < ^for any j G {1, . . . , d}. 
Then, for A < |, PIM does not fail and its expected excess risk is upper bounded 
by that is 

^-r^(;^P^)-^(^Ms)<3;^ (3.2.2) 

It essentially comes from a result in sequential prediction and the fact that 
results expressed in cumulative loss can be transposed to our setting since the 
expected risk of the randomized procedure based on sequential predictions is pro- 
portional to the expectation of the cumulative loss of the sequential procedure. 
Precisely, the following statement holds. 

Lemma 7 Let Abe a learning algorithm which produces the prediction function 
A{Zl) at time i + i.e. from the data Z\ = {Zi, . . . , Zi). Let L be the randomized 
algorithm which produces a prediction function £i{Z]') drawn according to the 
uniform distribution on {A{^),A{Zi), . . . ,yi(Z")}. The (doubly) expected risk 
of L is equal to times the expectation of the cumulative loss of A on the 
sequence Zi, . . . , Zn+i, where Zn+i denotes a random variable independent of 
the training set = [Zi, . . . , Zn) and with the same distribution P. 

My second contribution to model selection aggregation in [i9| is to provide a 
different viewpoint of the progressive mixture rule from the one in [[721 . leading to 
a slight improvement in the moment condition of the initial version of [[721 . The 
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result is the following and is extended to the Lq loss functions for g > 1 in [|9l 
Section 7]. 

Theorem 8 Assume that \\gj\\oo < Iforany j G {1, . . . ,d}, andKlY]"^ < A for 
some s > 2 and A > 0. For A = (7^ ^l^Mj^/*^''"'"^^ ^/^/j (j^ > g, the expected excess 
risk of PM is upper bounded by C(^^) ^^^^^'^\ that is 

^^r^f^TTT ^) " ^^^^^^ - ^ 

/or a quantity C which depends only on Ci, A and s. 

The convergence rate cannot be improved in a minimax sense [|9l Section 
8.3.2]. These results show how heavy output tails influence the learning rate: 
for the limiting case s = 2, the bounds are of order n~^/^ while for s going to in- 
finity, it is of order of n^, that is the rate in the bounded case, or in the uniformly 
bounded conditional exponential moment setting. 

The lower bounds developed to prove the minimax optimality of the above 
result are based on a refinement of Assouad's lemma, which allows to get much 
tighter constants. For instance, they improve the lower bounds for Vapnik-Cervo- 
nenkis classes [53, Chapter 14] by a factor greater than 1000, and lead to the 
following simple bound. 

Theorem 9 Let 3^ be a set of binary classification functions ofVC-dimension V. 
For any classification rule f trained on a data set of size n > j, there exists a 
probability distribution generating the data for which 

ER{f)-MR{f)>l^. (3.2.3) 

3.2.3. Limitation of progressive indirect mixture rules. Let de- 
note a progressive indirect mixture rule (it could be a progressive mixture or not) 
for some A > 0. Under boundedness assumptions (and even under some ex- 
ponential moment assumptions) and appropriate choice of A, gx satisfies an ex- 
pected excess risk bound of order Then one would also expect the excess 
risk R{g) — R{gMs) ^'^ °f order ^^s^ with high probability. In fact, this does not 
necessarily happen as the following theorem holds for d = 2. 

Theorem 1 Let gi and g2 be the constant functions respectively equal to 1 and 
— 1. For any A > and any training set size n large enough, there exist e > and 
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a distribution generating the data for which Y G [—1,1] almost surely, and with 
probability larger than e, we have 



n 

where c is a positive constant only depending on X. 

More precisely, in dH, it is shown that for large enough n, and some constants 
Ci > 1/2 and C2 > only depending on A, with probability at least we 
have R{gx) — -R(5'ms) — C2A/(log n)/n. Since Ci > 1/2, there is naturally no 
contradiction with the fact that, in expectation, the excess risk is of order 

3.2.4. Getting round the previous limitation. I now present the algo- 
rithm introduced in [[8l, and called the empirical star estimator, which has both 
expectation and deviation convergence rate of order The empirical risk of a 
prediction function (7 : X — t- R is defined by 



1 



n 

i=l 

Let g^^™^^ be an empirical risk minimizer among the reference functions: 

^(erm) ^ argmin r{g). 

For any prediction functions g, g', let [g, g'] denote the set of functions which are 
convex combination of g and g': [g,g'] = {ag + (1 — a)g' : a G [0, 1]}. The 
empirical star estimator g^^^^^ minimizes the empirical risk over a star-shaped set 
of functions, precisely: 

^(star) ^ argmin '^(fi')- 

(;e[s<'=™),9i]U-U[g<'=™>,gd] 

The main result concerning this estimator is the following. 

Theorem 11 Assume that < -B almost surely and \\gj\\oo < B for any j G 
{1, . . . ,d}. Then the empirical star algorithm satisfies: for any e > 0, with 
probability at least 1 — e, 

i?(^(star)) _ j^^g*^^^ < 200BHog[3d{d-l)E-'] ^ GOOBHogids-') ^ 
Consequently, we also have 

.(star), o..* 40052 log(3rf) 



n 
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An additional advantage of this empirical star estimator is that it does not 
need to know the constant B. In other words, it is adaptive to the smallest value 
of B for which the boundedness assumptions hold. This was not the case of the 
progressive mixture rules in which we need to take A < 1/(25^) for the indirect 
ones and A < l/(8i?^) for the "direct" one in order to state Inequality (I3.2.2I) . 
On the negative side, the theoretical guarantee on the expected excess risk is 200 
times larger than the one stated for the best PIM. However, this is more an artefact 
of the intricate proof of Theorem [TT] than a drawback of the algorithm. 

Another difference between progressive mixture rules is that the function out- 
put by the estimator is inside Ui<j^k<d[gj, dd], which is not in general the case 
for the progressive (indirect) mixture rules. We have already seen in Section l3.2.1l 
that the empirical risk minimizer on {gi, . . . , gd} has not the minimax optimal 
rate. A natural question in view of the empirical star algorithm is whether empir- 
ical risk minimization on ^i<j<k<d[gj, dd] would reach the (log d)/n rate. It can 
be proved for c? = 3 that, even under boundedness assumptions, the rate cannot 
be better than n^/^ for an adequate choice of the functions and the distribution 
(proof omitted by lack of interest in negative results). 

Interestingly, Lecue and Mendelson [|9TI proposed a variant of the empirical 
star algorithm, which also uses the empirical risk minimizer g^^™^ to define a set 
of functions on which the empirical risk is minimized. Precisely, for a confidence 
level e > 0, let § be the set of functions g E {gi, . . . , gd} satisfying 



\ n \ n 
^ ^ BHog{2de-') 
n 



where C is a positive constant. The final estimator is the empirical risk minimizer 
in the convex hull of S- It is also shown there that the selection of a subset of func- 
tions S before taking the convex hull is necessary to achieve the minimax conver- 
gence rate since the empirical risk minimizer on the convex hull of {gi, . . . , ga} 
has an excess risk at least of order 1 / ^/n for an appropriate distribution and d of 
order y/n. 

The advantage of the empirical star algorithm over the empirical risk mini- 
mizer on the convex hull of 9 is its adaptivity to both the confidence level and 
the constant B, and a theoretical guarantee of the form C — - instead of 
^ iog(d) iog(£ — ) empirical risk minimizer on the convex hull of §. 
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3.3. Convex aggregation 



When d < y/n, the minimax learning rate for problem (C) and (L) are both of 
order ^, meaning that estimators solving problem (L) are solutions to problem (C) 
for d < y/n. So, estimators for d < y/n are given in the section devoted to linear 
aggregation (Section [34l) . and this section focuses on the case when d > n2~^^. 

The literature contains few results for problem (C) with constant c = 1 in 
(13.1.21) and minimax optimal residual term for > ^2+'^, with S > 0. The first 
type of results is to apply the progressive mixture rule on an appropriate grid of 
the simplex [I123L Another solution is to use the exponentiated gradient algorithm 
introduced and studied by Kivinen and Warmuth 11741 in the context of sequential 
prediction for the quadratic loss, and then extended to general loss functions by 
Cesa-Bianchi [43 1. Lemma |7] has to be invoked to convert these algorithms and 
the bounds to our statistical framework. Juditsky, Nazin, Tsybakov and Vayatis 
iTTSlI has viewed the resulting algorithm as a stochastic version of the mirror de- 
scent algorithm, and proposed a different choice of the temperature parameter, 
while still reaching the optimal convergence rate. All the above results hold in 
expectation, and it is not clear that the deviations of the excess risk bounds are 
sub-exponential. The estimator presented hereafter does not share this drawback. 

To address problem (C) (defined in page |20l), the first chapter of my PhD 
thesis establishes empirical excess risk bounds for any estimator that produces a 
prediction function in the convex hull of gi, . . . , ga whatever the empirical data 
are. Any such estimator g can be associated with a function p mapping a training 
set to a distribution on {gi, . . . , g^} such that ^(Z") = Kg^^i^z^^g. Conversely, any 
mapping p from 2." (the set of training sets of size n) to the set M of distributions 
on {(71, ... , gd} defines an estimator 

g = Eg^fjg, 

where we have dropped the training set for sake of compactness. Similarly, 
there exists a distribution p^ on {gi, . . . , ^f^} such that 

9*c = ^gr^P^g- 

The assumptions are boundedness of the functions gi, . . . ,gii and of the re- 
gression function g^^'^^^ : x i— )■ E(F|X = x) and uniform boundedness of the con- 
ditional exponential moments of the output knowing the input. Precisely, there 
exist B > 0, a > 0, and M > such that for any g', g" in {g^^^^\ gi, . . . , gj}, 
Wg' — 9"\\oo < B and for any a; G X, 
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Theorem 12 Under the above assumptions, there exist Ci,C2 > depending 
only on the constant M and the product aB such that for any (prior) distribution 
TT G M, any e > 0, and any aggregating procedure p : ^ M, with probability 
at least I — e, 

RiE^^^g) - Rig*c) < min ((1 + A) [riE^^^g) - r(^?*)] 

Ae[0,Gi] 

2A^ , ^E^ K(p,7r) + log(21og(2n)e-i) ) 

1=1 ■' 

(3.3.1) 

This bound comes from the PAC-Bayesian analysis, and consequently, the 
complexity of an aggregating procedure is measured by the KuUback-Leibler di- 
vergence of p with respect to some prior distribution vr on {gi, . . . , ga}. In ab- 
sence of prior knowledge, tt is chosen as the uniform distribution, which allows 
to bound uniformly the KL divergence by log d. Besides the usual empirical ex- 
cess risk. Inequality (13.3.11) depends on the empirical variance of g(x) when g is 
drawn according to p. Unlike the KuUback-Leibler term, this term is small for 
concentrated posterior distributions. 

All previous results of this chapter were easily generalizable to loss of quadratic 
type under boundedness assumptions, that is loss with second derivative with re- 
spect to its second argument uniformly lower and upper bounded by positive con- 
stants. To my knowledge, the generalization cannot be done here as the analysis 
strongly relies on the remarkable identitjH 

RiEg^pg) = E(,,,,.)^,^,E[r - g'{X)][Y - g"{X)], (3.3.2) 

which is specific to the quadratic loss and allows to apply the PAC-Bayesian anal- 
ysis for distributions on the product space {gi, . . . , ^f^} x {gi, . . . , g^}. 

Let pc be the distribution minimizing the right-hand side of (13.3.11) with vr 
the uniform distribution on {gi, . . . ,(7^} and where — (1 + \)r(g^) is replaced 
by its upper bound -r(^*) - ^T^^T^ge{j:^^,e,g,;e,>o,...,e,>o,j:^^,0,=i}^i9)- When 
defining pc, for sake of computability of the estimator Q Chap. 1, Theorem 4.2.2], 
one can also replace the minimum over [0, Ci] by a minimum over a geometric 
grid of the interval [n~^, Ci] without altering the validity of the following theorem. 

Theorem 1 3 For any e > 0, with probability at least 1 — e, we have 

R(E,.,^g) - msi) < cg^ '°g('»°g(2")^-\ var,.,,,(A-) 

^Tobe precise, |7, Chap.l] used RjEg ^pg) = Eg^pR{g) - ^Eg,^pEg„^pE[g' {X) - g"{X)]^ , 
but it would have been more direct to use (13.3.2b . 
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2log(rflog(2n)e 1) 



+ CB 

n 

for some constant C > depending only on aB and M. 



Since E Var^^p* gi.^) ^ the excess risk is at most of order y^l^lMl^lHli))^ 

that is the minimax convergence rate of the convex aggregation task for d > n^~^^, 
with 5 > 0. Besides, when the best convex combination occurs to be a vertex of 
the simplex defined by {gi, . . . , gii} the variance term equals zero, and thus, the 
convergence rate is ISli^i^IHzili^ that is the minimax convergence rate of model 
selection type aggregation (at least for d > log(2?2)). 



3.4. Linear aggregation 



To handle problems (C) and (L) in the same framework and also to incorpo- 
rate other possible constraints on the coefficients of the linear combination, let us 
consider 6 a closed convex subset of R"*, and define 

^ i=i ^ 

Introduce the vector-valued function g : x t-)- {gi{x), . . . , gd{x))'^ . The function 
Yl'j=i ^j9j then be simply written {9, g) with 6 = (6'i, . . . , dd)^ . Let 

g* E argmin R{g). 

<?es 

Thus, when 6 is the simplex of R"', we have g* = g^ and when = MJ^, we have 
9* = 9l- 

Aggregating linearly functions to design a prediction function with low quadratic 
risk is just the problem of linear least squares regression. It is a central task in 
statistics, since both linear parametric models and nonparametric estimation with 
linear approximation spaces (piecewise polynomials based on a regular partition, 
wavelet expansions, trigonometric polynomials, . . . ) are popular. It has thus been 
widely studied. 

Classical statistical textbooks often only state results for the fixed design set- 
ting as a bound of order d/n can be rather easily obtained in this case. This can be 
misleading since it does not imply a d/n upper bound in the random design set- 
ting. For the truncated ordinary least squares estimator, Gyorfi, Kohler, Krzyzak 
and Walk [[63l Theorem 11.3] give a bound of the form of (|3.1.2[ page [221) with 
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residual term of order ^l^li and c = 8. When the input distribution is known, Tsy- 
babov HI 231 provides a bound of order rf/n on the expected risk of a projection es- 
timator on an orthonormal basis of S for the dot product (/, g) t-^ ¥,[f(X)g(X)]. 

Catoni [37, Proposition 5.9.1] and Alquier [5J have used the PAC-Bayesian 
approach to prove high probability excess risk bounds of order d/n involving 
the conditioning of the Gram matrix Q = E [g{X)g(X)'^~\ . Both results require at 
least exponential moments on the conditional distribution of the output Y knowing 
the input vector g{X). 

It can be derived from the work of Birge and Massart ||3TI an excess risk 
bound for the empirical risk minimizer of order at worst and asymptoti- 

cally of order d/n. It holds with high probability, for a bounded set and re- 
quires bounded input vectors and conditional exponential moments of the output. 
Localized Rademacher complexities [l80l l26l also allows to study the empirical 
risk minimizer on a bounded set of functions. They lead to a high probability 
d/n convergence rate of the empirical risk minimizer under strong assumptions: 
uniform boundedness of the input vector, the output and the parameter set 6. 

Penalized least squares estimators using the L^-norm of the vector of coeffi- 
cients, or more recently, its L^-norm have also been widely studied. A common 
characteristic of the excess risk bounds obtained for these estimators is that it is 
of order d/n only under strong assumptions on the eigenvalues (of submatrices) 
ofg. 

In [14J, Olivier Catoni and I provide new risk bounds for the ridge estima- 
tor and the ordinary least squares estimator (Section 13.4.11) . We also propose a 
min-max estimator which has non-asymptotic guarantee of order d/n under weak 
assumptions on the distributions of the output Y and the random variables gj{X), 
j = 1, . . . ,d (Section [3.4.21) . Finally, we propose a sophisticated PAC-Bayesian 
estimator which satisfies a simpler d/n bound (Section [3. 4. 31) . 

The key common surprising factor of these results is the absence of expo- 
nential moment condition on the output distribution while achieving exponential 
deviations. All risk bounds are obtained through a PAC-Bayesian analysis on trun- 
cated differences of losses. Our results tend to say that truncation leads to more 
robust algorithms. Local robustness to contamination is usually invoked to advo- 
cate the removal of outliers, claiming that estimators should be made insensitive 
to small amounts of spurious data. Our work leads to a different theoretical expla- 
nation. The observed points having unusually large outputs when compared with 
the (empirical) variance should be down-weighted in the estimation of the mean, 
since they contain less information than noise. In short, huge outputs should be 
truncated because of their low signal to noise ratio. 
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3.4.1. Ridge regression and empirical risk minimization. The ridge 
regression estimator on S is defined by g^'^^se) — (^gindge)^ ^ ^^^^ 

^(ridge) g argminr((^,^) + X\\e\\^, 

where A is some nonnegative real parameter and r((^, ^)) is the empirical risk of 
the function {9,g). In the case when A = 0, the ridge regression g^"'^^^^ is nothing 
but the empirical risk minimizer g^^™\ 

In the same way we consider the optimal ridge function optimizing the ex- 
pected ridge risk: g = {6, g) with 

e~eargmm{i?((e,5» + A||^||^}. 

Our first result is of asymptotic nature. It is stated under weak hypotheses, 
taking advantage of the weak law of large numbers. 

Theorem 14 Let us assume that 

E[||^(X)r] <+oo, (3.4.1) 
and ¥.^\\g{X)f[g{X)-Yf^<+oo. (3.4.2) 

Let ui, . . . ji/fi be the eigenvalues of the Gram matrix Q — 'E[g{X)g{X)'^~\, 
and let Qx — Q + XI be the ridge regularization ofQ. Let us define the effective 
ridge dimension 

^ = E ;7XT^'''>° = ^[(^ + ^^y'Q] = E[\\Q-'/'g{X)r] . 



1=1 



When X — Q,D is equal to the rank ofQ and is otherwise smaller. For any £ > 0, 
there is rie, such that for any n > rie, with probability at least 1 — e, 

i?(^(ridge)) + X\\0(^^s^^\\^ < min {R{{e, g)) + X\\9\\^} 

+ Cess sup E{[y - ~g{Xmx} ^ + ^"g^^^"') , 

for some numerical constant C > 0. 

This theorem shows that the ordinary least squares estimator (obtained when 
G = R'^ and A = 0), as well as the empirical risk minimizer on any closed 
convex set, asymptotically reach a d/n speed of convergence under very weak 
hypotheses. It shows also the regularization effect of the ridge regression. There 
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emerges an effective dimension D, where the ridge penalty has a threshold effect 
on the eigenvalues of the Gram matrix. 

On the other hand, the weakness of this result is its asymptotic nature : 
may be arbitrarily large under such weak hypotheses, and this shows even in the 
simplest case of the estimation of the mean of a real-valued random variable by 
its empirical mean, which is the case when d = 1 and g{X) = 1 Il42l . Typically, 
the proof of Theorem [141 shows that is of order 1/e. To avoid this limitation, 
we were conducted to consider more involved algorithms as described in the fol- 
lowing two sections. 

3.4.2. A MIN-MAX ESTIMATOR FOR ROBUST ESTIMATION. This section pro- 
vides an alternative to the empirical risk minimizer with non asymptotic expo- 
nential risk deviations of order d/n for any confidence level. Moreover, we will 
assume only a second order moment condition on the output and cover the case 
of unbounded inputs, the requirement on the random variables gj{X) being only 
a finite fourth order moment. On the other hand, we assume that the set 6 of the 
vectors of coefficients is bounded. (This still allows to solve problem (L) as soon 
as we know a bounded set in which lies for sure.) 
Let a > and consider the truncation function: 

{-log(l -x + a;V2) 
log(2) 
-T(-x) 

For any g,g' E S, introduce 

n 

'Dig, 9') = Y.^ (« [y^ - 9{X,)] ^-a[Y,- (^'(X,)] . 

i=l 

Let us assume in this section that for any j E {I, . . . ,d}, 

E{g,{Xf[Y - g*{X)]^} < +oo, (3.4.3) 

and 

E[gjiX)] < +00. (3.4.4) 

Define 

§ = {gE span{(7i, . . . , gd} : ^[(^(X)^] = 1}, (3.4.5) 
a = ^Je{[Y - g*{X)]^} = (3.4.6) 
X = max^/E[g{X)% (3.4.7) 



< X < 1, 

x>l, 

x<0, 
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¥.{[g{XYQ-^g[X)Y] 



n = ■ — , (3.4.8) 



E[giXVQ-^giX)] ' 
E{[Y-g*iX)Y} JE{[Y-g*iX)]^} 



E{[Y-g*{X)]^} a 



2 



(3.4.9) 

•R = ms^^^jE{[g'{X) - g"{X)f]. (3.4.10) 

Theorem 1 5 Let us assume that (13.4.31) and (13.4.41) hold. For some numerical 
constants c and c', for 

n > CKxd, 



by taking 

1 

a 



2x-[2v^ff+^3J] 

for any estimator g satisfying g G 9 ci.s., for any e > 0, with probability at least 
1 — e, we have 



Rig) - R{gl <— f max 'D{g, g') - inf max Dig, g')) 



n 1 — 



The above theorem suggest to look for function realizing the min-max of 
{g-, g') ^ ^(5') g')- More precisely, an estimator such that 

max D[g, g') < inf max T>(g, g') + cr^— , 

g'eS sgS g'eS n 

has a non asymptotic bound for the excess risk with a (i/n convergence rate and 
an exponential tail even when neither the output Y nor the input vector g{X) 
has exponential moments. This stronger non asymptotic bound compared to the 
bounds of the previous section comes at the price of replacing the empirical risk 
minimizer by a more involved estimator. Nevertheless, reasonable heuristics can 
be developed to compute it approximately [[141 Section 3], and leads to a signifi- 
cantly better estimator of gl^ than the ordinary least squares estimator when there 
is some heavy-tailed noise (see Appendix iGl). 

3.4.3. A SIMPLE TIGHT RISK BOUND FOR A SOPHISTICATED PAC-BAYES AL- 
GORITHM. A disadvantage of the min-max estimator proposed in the previous 
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section is that its theoretical guarantee depends (implicitly) on kurtosis like coef- 
ficients. We provide in [14, Section 4] a more sophisticated estimator, having the 
following simple excess risk bound independent of these kurtosis like quantities, 
and still of order ^. It holds under stronger assumption on the input vector g{X) 
(precisely, uniform boundedness), still assumes that the set is bounded, and 
holds under a second order moment condition on the output. 

Theorem 16 Assume that S has a diameter H for L'^-norm: 

sup \g'{x) - g"{x)\ = H (3.4.12) 
g',9"es,xex 

and that, for some a > 0, 

supE{[F - g*{X)]^\X = x} <a'- < +oo. 

There exists an estimator g such that for any e > 0, with probability at least 1 — e, 
we have 

R{g)-R{g*)<l7{2a + Hf^±^^^^. 

n 

On the negative side, when the target is to solve problem (L), it requires 
the knowledge of a -bounded ball in which f^^^ lies and an upper bound on 
sup^gjE|[F — /j*^(X)]^|X = x}. The looser this knowledge is, the bigger the 
constant in front of d/n is. On the positive side, the convergence rate is of order 
d/n, without neither extra logarithmic factor, nor constant factors involving the 
conditioning of the Gram matrix Q or some Kurtosis like coefficients. 

To conclude this section, let us add that, when the output admits uniformly 
bounded conditional exponential moments, a relatively simple Gibbs estimator 
also achieves the d/n convergence rate. Precisely we have the following theorem. 

Theorem 17 Assume that (13.4.121) holds for H < +oo, and that there exist 
a > and M > such that for any x G X, 

Consider the probability distribution vr on S defined by its density with respect to 
the uniform distribution it on Q: 

^^^^ " E3,^^e-^Sr=iK-s'(^.F' 

where \ > is appropriately chosen (depending on a, H and M). For any e > 0, 
with probability at least 1 — e, we have 

where the quantity C > only depends on a, H and M. 
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3.5. High-dimensional input and sparsity 



From the minimax rates of the three aggregation problems, we see that for 
n <ti d <^ e"", one can predict as well as the best convex combination up to a 

small additive term, which is at most of order sj'^^, but one cannot expect to 
predict in general as well as the best linear combination up to a small additive 
term. In this setting, one may want to reduce its target by trying to predict as well 
as (still up to a small additive term) the best linear combination of at most s <^ d 
functions, that is the function 

g* G argmin R{g)- (3.5.1) 

It is well-established that regularization allows to perform this task. The pro- 
cedure is known as Lasso 11221 [IH and is defined by /('"^^°' = {e^^^''''\ g) with 

1 

^(lasso) ^ ^^^^.^ - V (F, - {e,g{X,))Y + All^^lli, 

where A > is a parameter to be tuned to retrieve the desired number of relevant 
variables/functionfl As the penalty used in ridge regression, the penalty 
shrinks the coefficients. The difference is that for coefficients which tend to be 
close to zero, the shrinkage makes them equal to zero. This allows to select rele- 
vant variables/functions (i.e., find the j's such that 9* ^ 0). 

If we assume that the regression function g^^^^^ is a linear combination of only 
s <ti d variables among {gi, . . . , g^}, the typical result is to prove that the ex- 
pected excess risk of the Lasso estimator for A of order ^y {log d)/n is of order 
(s log d)/n [|36l 11241 1105[|93l . Since this quantity is much smaller than d/n, this 
makes a huge improvement (provided that the sparsity assumption is true). This 
kind of results usually requires strong conditions on the eigenvalues of submatri- 
ces of Q, essentially assuming that the functions gj are near orthogonal. Here we 
will argue that by combining the estimators solving (MS) and (L), one can achieve 
minimax optimal learning rate without requiring such conditions. The guarantees 
presented here are also stronger than the ones associated with Lo-regularization 
(penalization proportional to the number of nonzero coefficient) whatever crite- 
rion (Mallows' Cp [[961, AIC ^ or BIC 1111611 ) is used to tune the penalty constant. 
Recent advances on theoretical guarantees of Lo-regularization can be found in the 
works of Bunea, Tsybakov and Wegkamp ll36l and of Birge and Massart [f32ll for 

The functions gi,. . . ,gd can be called the explanatory variables of the output. Note also 
that we can consider without loss of generality that the input space is M'* and that the functions 
gi, . . . ,gd are the coordinate functions. 
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the fixed design setting and in the work of Raskutti, Wainwright and Yu HI 1411 
for the random design setting considered here. These results for Lo-regularization 
are not as good for the ones for the estimator described in this section since the 
{s log d)/n excess risk bound holds only when the conditional expectation of the 
output knowing the input is inside the model. 

Precisely, let us assume^ that for some B > 0, \\g*\\oo < B and |F| < B. 
Let Li denote the first half of the training set {Zi, . . . , Zn/2}, and £2 denote 
the second half of the training set {Zn/2+1, ■ ■ ■ , Zn}, where for simplicity we have 
assumed that n is even. For any I C {1, . . . , d} of size s, let gi be the sophisticated 
estimator that satisfies TheoTem\T6\trained on Li and associated with the set S/ = 
{ (^) 9) '■ II (^) ^) lloo < B, 9j = for any j ^ I^. (One can alternatively consider 
the Gibbs estimator of Theorem [17J) Let g be the empirical star estimator (defined 
in Section [3.2.41) trained on C2 and associated with the (^) functions gi (that are 
non-random given Li). This two-stage estimator satisfies the following theorem. 

Theorem 1 8 For any e > Q, with probability at least 1 — e, 

R{g) - Rig*) < cB^ s\og{d/ s) + \og{2e-^) ^ ^3_^_2) 

n 

for some numerical constant C > 0. 

Proof. From Theorem [TTl since we have (j) < (erf/s)'', with probability at least 
1 — e/2, we have 



2slog(e(i/s) + log(2£ ^) 



Rig) - min Rih) < 12005 

Let /* be a set of s variables containing the set of at most s variables involved in 
g*. From Theorem [T6l with probability at least 1 — £:/2, we have 



2S + log(4£-l^ 



n 



R{gi,)-R{g*) < 12245 
By using an union bound, we obtain 



n n 
which gives the desired result. □ 



^We make boundedness assumptions for sake of simplicity. The results can be generalized to 
outputs having exponential conditional moments since both building blocks of the estimator can 
handle this type of noisy outputs: for the empirical star algorithm, see the supplemental material 
of m. Further generalizations are open problems. 
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Due to the particular structure of the empirical star algorithm, the estimator g 
can be written as a linear combination of at most 2s functions among {gi, . . . , gd}, 
so that the estimator can be used for variable selection. The functions involved in 
g* do not necessarily belong to this set of at most 2s functions. I do not believe that 
achieving such identifiability of these particular relevant variables should be the 
goal, since pursuing this target would definitely require that the different variables 
are not too much correlated, a situation which will rarely occurs in practice. 

Adaptivity with respect to the sparsity level of g* can also be obtained. Indeed, 
let s* be the number of nonzero coefficients of the function g* defined by (13.5.11) . 
By using a three-stage estimator procedure using successively the empirical star 
algorithm at a given level of sparsity and then on the s functions thus designed, 
it is easy that (13.5.21) still holds with s replaced by s*. Note that g* defined in 
(13.5.11) depends on a sparsity level s, which can be taken equal to d. Then we 
have g* = gl, and the three-stage procedure is adaptive to the sparsity level of 
g^. In the fixed design setting, Bunea, Tsybakov and Wegkamp |[36l have shown 
that these rates are minimax optimal, and it is natural to consider that their lower 
bound extends to our random design case. 

Another possible use of the algorithms solving problems (MS) and (L) is when 
we consider sparsity with group structure. This occurs when the variables are 
naturally organized into groups: in computer vision, this naturally occurs since 
there exist different families of image descriptors, and the grouping can be done 
by family, scale and/or position. Let Ii, . . . , Id C {1, . . . , d}he D sets of grouped 
variables. For a vector 9, let us say that a group h is active if there exists j E h 
such that 9j ^ 0. let S{9) be the number of active groups among h, . . . , Id- 

For a given sparsity level s G {1, . . . , D}, the target is 

ge{{e,g}-em'',s{e)<s} 

There exist only (^) different sets of s groups that could be active. So a two-stage 
estimator ^(g™"P) similar to the one described before satisfies that with probability 
at least 1 — e, 

^(^(group)) _ ^(^(group)) < ^^,s\ogiD/s) + J + \ogi2e-^)^ 

n 

where J denotes the number of nonzero coefficients in the linear combination 
defining g^^^°"P'>. This type of results has not been obtained yet for the group Lasso 
I1135II even when assuming low correlation between the variables, except for the 
fixed design setting [16911941 . 

We have presented in this section an example of theoretical results easily ob- 
tainable from the estimators solving problems (MS) and (L). The results are ex- 
pressed in terms of sub-exponential excess risk bounds, which were not obtainable 
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before the introduction of the empirical star algorithm. An advantage of the ap- 
proach is its genericity: it is not restricted to particular families of estimators. 

There are yet some limitations. First, there is no variable selection consis- 
tency with this approach, but as stated before, this stronger type of results would 
require strong assumptions on the input vector distribution, that are often not met 
in practice. In the fixed design setting, for overlapping groups, Jenatton, Bach and 
I [TTOI have proved a high dimensional variable consistency result extending the 
corresponding result for the Lasso [I1381I128I . 

Second, the approach does not extend easily to the case of generalized additive 
models, in which linear combinations of a fixed number of functions are replaced 
by functional spaces II104L such as reproducing kernel Hilbert spaces in the cases 
of multiple kernel learning MB M [IIIl IM M EUl • 

Finally, the most important limitation, which is often encountered when us- 
ing classical model selection approach, is its computational intractability. So this 
leaves open the following fundamental problem: is it possible to design a com- 
putationally efficient algorithm with the above guarantees (i.e., without assuming 
low correlation between the explanatory variables)? 
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Chapter 4 
Multi-armed bandit problems 



4.1. Introduction 

Bandit problems illustrate the fundamental difficulty of decision making in the 
face of uncertainty: a decision maker must choose between following what seems 
to be the best choice in view of the past ("exploiting") or testing ("exploring") 
some alternative, hoping to discover a choice that beats the current best choice. 
More precisely, in the multi-armed bandit problem, at each stage, an agent (or 
decision maker) chooses one action (or arm), and receives a reward from it. The 
agent aims at maximizing his rewards. Since he does not know the process gen- 
erating the rewards, he needs to explore (try) the different actions and yet, exploit 
(concentrate its draws on) the seemingly most rewarding arms. 

The multi-armed bandit problem is the simplest setting where one encoun- 
ters the exploration-exploitation dilemma. It has a wide range of applications 
including advertizement lllTllSlll . economics ||29l[85]|, games |[59l and optimiza- 
tion [TTTI l48l 1761 [35ll . It can be a central building block of larger systems, like in 
evolutionary programming f68] and reinforcement learning [I119L in particular in 
large state space Markovian Decision Problems [i79il . The name "bandit" comes 
from imagining a gambler in a casino playing with K slot machines, where at 
each round, the gambler pulls the arm of any of the machines and gets a payoff as 
a result. The seminal work of Robbins [ iTSlI casts the bandit problem in a stochas- 
tic setting in which essentially the rewards obtained from an arm are independent 
and identically distributed random variables that are also independent from the 
rewards obtained from the other arms. Since the work of Auer, Cesa-Bianchi, 
Freund and Schapire lfT9]| . it was also studied in an adversarial setting. 

To set the notation, let > 2 be the number of actions (or arms) and n > K 
be the time horizon. A i^-armed bandit problem is a game between an agent and 
an environment in which, at each time step t E {1, . . . , n}, (i) the agent chooses a 
probability distribution pt on a finite set {1, ... , K}, (ii) the environment chooses 
a reward vector gt = ((?i,t, . . . , gK,t) ^ [0, 1]^ (possibly through some external 
randomization), and simultaneously (independently), the agent draws the arm It 
according to the distribution pt, (iii) the agent only gets to see his own reward gj^^f 
The goal of the decision maker is to maximize his cumulative reward Ylt=i 9it,t- 

In the stochastic bandit problem, the environment cannot choose any reward 
vectors: the reward vectors gt have to be independent and identically distributed, 
and its components should be independent random variable^. So an environment 

^The independence of the components is always made in the hterature, but is not fundamentally 
useful (up to rare modifications of the numerical constants). 
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is just parameterized by a ii'-tuple of probability distributions (z/i, . . . , z/^) on 
[0, 1]. Note that the term "stochastic bandit" can be a bit misleading since the 
assumption is not just stochasticity but rather an i.i.d. assumption. 

In the adversarial bandit problem, no such restriction is put so that past gains 
have no reason to be representative of future ones. This contrasts with the stochas- 
tic setting in which confidence bounds on the mean reward of the arms can be 
deduced from the rewards obtained so far. 

A policy is a strategy for choosing the drawing probability distribution pt 
based on the history formed by the past plays and the associated rewards. So 
it is a function that maps any history to a probability distribution on {1, ... , K}. 
We define the regret of a policy with respect to the best constant decision as 



To compare to the best constant decision is a reasonable target since it is well- 
known that (i) there exist randomized policies ensuring that KRn/n tends to zero 
as n goes to infinity, (ii) this convergence property would not hold if the maximum 
and the sum would be inverted in the definition of This chapter will first 
present my contributions to the stochastic bandit problems, essentially: 

• how to use empirical variance estimates in upper confidence based policies? 
(Section 13211) 

• how thin is the tail distribution of the regret of standard policies, and how 
can we improve it? (Section [4. 2. 5 1) 

• provide a minimax optimal policy (Section r4.2.6l) . 

• propose a model and an arm-increasing rule to deal with bandit problems 
with more arms than draws: K > n (Section [4.2.71) . 

• design and use a Bernstein's bound with estimated variances to have better 
stopping rules (Section [4. 2. 8 1) . 

• provide a policy to identify the best arm at the end of the n time steps 
(Section l4T9l) . 

Sebastien Bubeck and I lfT2l contribute to the adversarial setting by designing a 
new type of weighted average forecaster characterized by an implicit normaliza- 
tion of the weights, and for which a new type of analysis can be developed. The 
advantage of the policy and the analysis is that it allows to bridge the long open 
logarithmic gap in the characterization of the minimax rate for the multi-armed 



n 




t=l 



(4.1.1) 
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bandit problem, and to have a common framework for addressing other sequen- 
tial prediction problems (full information, label efficient, tracking the best expert) 
(Section |43]). 



4.2. The stochastic bandit problem 

4.2.1. Notation. Let Ti{t) denote the number of times arm i is chosen by 
the policy during the first t plays. Define /ij = J xui^dx) the expectation and 
Vi = J{x — iiiYvi{dx) the variance of the distribution ui characterizing arm i. Let 
i* e argmin-g|^ denote an index of an optimal arm. The suboptimality of 

an arm i is measured by: 

Aj = max - fii= fii' - fii. 

j=l,...,K 

Let Xi^t be the t-th reward obtained from arm i if Ti{n) > t, and for t > Ti{n), let 
Xi^t be other independent realizations of z/j. For any i E {1, . . . , K} and s G N, 
introduce Xi^s and Vi^s the empirical mean and variance of Xj^i, . . . , Xi^s- 



1 ^ - 1 — 

-"^Xij and Vi^s = - ^{Xij - Xi^s 



i=i i=i 



2 



4.2.2. Regret notion. Previous works in the stochastic bandit problem do 
not use the regret defined by (14.1.11) . which is a regret with respect to the best 
constant decision, but a (pseudo-)regret that compares the reward of the policy to 
the reward of an optimal arm in expectation, that is i* E argmin^^i^^ xyf^i'- 

n 

Rn = ^ (fi'i*,t — git,t) < Rn- 
t=l 

Results concerning this regret are easier to state, and we will follow hereafter 
the trend of previous works to state the results in terms of R„ . In this section, 
we gather results showing how to go from an upper bound on i?„ to an upper 
bound on _R„. The following lemma shows that logarithmic regret bounds on Ei?„ 
extend to logarithmic regret bounds on Ei?„ when the optimal arm is unique, that 
is iii < fii* for any i ^ i*. Besides, unlike known upper bounds for Ei?„, the ones 
on E_R„ depends on the variance Vi* of the reward distribution of the optimal arm. 
(When there are several optimal arms, it is the smallest variance of the optimal 
arms distributions which appears in the expected regret bound.) 
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Lemma 19 (^M) For a given S > 0, let I = {i e {1, . . . , K} : Ai < 6} be the 
set of arms "S-close" to the optimal ones, and J = {1, . . . , K} \ I the remaining 
set of arms. In the stochastic bandit game, we have 



^„ ^-r^ n\og\I\ \ ^ 1 / , Ox 



and also 



ERn-ERn < \ ^-^+> ^exp — — . 

In particular when there exists a unique arm i* such that Aj. = 0,we have 

and also for any t > 

_ X / (t + nAi)'^ \ 

- fl„ > t) < gexp [- „^^,^,^2V..+2V, + 2Wn + Am ) ' 

The uniqueness of the optimal arm is really needed to have logarithmic (in n) 
bounds on the expected regret. This can be easily seen by considering a two-armed 
bandit in which both reward distributions are identical (and non degenerated). In 
this case, the expected pseudo-regret is equal to zero while the expected regret will 
be at least of order y/n for any forecaster. This reveals a fundamental difference 
between the expected regret and the pseudo-regret. 

Previous works on stochastic bandits use the expected pseudo-regret criterion 
since it satisfies 

K 

ERn = J]AiET,(n), 

i=l 

meaning that one has only to control the expected sampling times of suboptimal 
arms to understand how the expected pseudo-regret behaves. 



4.2.3. Introduction TO UPPER CONFIDENCE BOUNDS POLICIES. Early pa- 
pers have studied stochastic bandit problems under Bayesian assumptions (e.g., 
Gittins [l6ri ). On the contrary, Lai and Robbins f84l have considered a parametric 
minimax framework. They have introduced an algorithm that follows what is now 
called the "optimism in the face of uncertainty principle". At time t = kt (mod K) 
with kt G {1, . . . , K}, their policy compares an upper confidence bound (UCB) 
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of the mean reward j^kt of kt to a reasonable target defined as the highest 
empirical mean of "sufficiently" drawn arms. If the upper confidence bound ex- 
ceeds the target, arm kt is drawn, and otherwise, the arm defining the reasonable 
target is drawn. Lai and Robbins proved that the expected regret of this policy 
increases at most at a logarithmic rate with the number of trials and that the al- 
gorithm achieves the smallest possible regret up to some sub-logarithmic additive 
term (for the considered family of distributions). Agrawal [2] proposed computa- 
tionally easier UCB algorithms in a more general setting that have also logarithmic 
expected regret (at the price of a higher numerical constant in the upper bound on 
the regret). More recently, Auer, Cesa-Bianchi and Fischer [|T8l have proposed 
even simpler policies achieving logarithmic regret uniformly over time rather than 
just for a fixed number n of rounds known in advance by the agent. Besides, 
unlike previous works, they have provided non asymptotic bounds. 

Upper confidence bounds policies can be described as follows. From time 1 to 
K, draw each arm once. At time t > K + \, draw the arm maximizing Bi^Ti{t-i),t, 
where -Bj s,t is a high probability bound on /ij computed from the i.i.d. sample 
Xj 1, . . . , Xj The confidence level of this high probability bound might depend 
on the current round t. For instance, the UCBl policy of Auer, Cesa-Bianchi and 
Fischer [,18J uses 



which is an upper bound on fii holding with probability at least 1 — t'"^ according 
to Hoeffding's inequality. 

Auer, Cesa-Bianchi and Fischer IfTSl also noted that plugging an upper confi- 
dence bound of the variance in the square root term performs empirically substan- 
tially better than UCBl. Precisely, their experiments used 



My first contribution to the multi-armed bandit problem was to provide a theoret- 
ical justification of these empirical findings, as described in the following section. 

4.2.4. UCB POLICY WITH VARIANCE ESTIMATES. Remi Munos, Csaba Sze- 
pesvari and I Q5| have proposed the following slight modification of the arm 
indexes given by (14.2.11) : 



with C > 1- The associated policy achieves a logarithmic regret as UCBl with 
a constant factor potentially much smaller than the one of UCBl. Indeed, from 





(4.2.1) 




(4.2.2) 
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|[I81,UCB1 satisfies 



(4.2.3) 



i:Ai>0 



whereas our algorithm, called UCB-V (V for variance), satisfies for ( > 1, 



with > a function of ( satisfying ci 2 < 10 and < C ^ J2t^ t + Cj for 
some numerical constant C > 0. We also proved that for specific distributions of 
the rewards, UCB-V with C < 1 suffers a polynomial expected (pseudo-)regret, 
that is Ei?„ > Cn'^ for some C > 0. The argument proving this later assertion 
also implies that using exactly the upper bound (|4.2.1I) can dramatically fail in 
some specific situation^ 

4.2.5. Deviation of the regret of UCB policies. In this section, we 
consider that there is a unique optimal arm i*. In (15], we show that the UCB-V 
policy defined by (14.2.21) satisfies 



for quantities C and C depending on K, (,ai, . . . , ctk, Ai, . . . , A^, but not on 
n. The "polynomial" rate in (14.2.51) is not due to the looseness of the bound. It 
can be shown that as soon as the essential infimum of the optimal arm's distri- 
bution jl = sup{f G M : z/j. ([0,f)) = 0} is smaller than the mean reward of 
the second best arm, the pseudo-regret admits a polynomial tail only: there ex- 
ists C > (depending on the distributions z/i, . . . , uk) such that for any C > 0, 
there exists uq > such that for any n > uq, P(-R„ > Clogn) > { c'ciogn ) ■ 



The regret concentration, although it improves as ( grows, is thus pretty slow. 
The slow concentration happens when the first draws of the optimal arm are 
unlucky (yielding small rewards) in which case the optimal arm will not be se- 
lected any more during roughly the first steps. As a result, the distribution of 

^For instance, when the optimal arm concentrates its rewards on and 1 (Bernoulli distribution 
with parameter 1 / 2), and when the other arms always provide a reward equal to 1/2 — the 
expected regret is lower bounded by Cn^/^. 

^ An entirely analogous result holds for UCB 1 : using the variance estimates or not does not 
change the form of the tail distribution of the regret. 




(4.2.4) 




(4.2.5) 
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the regret can be seen as a mixture of a peaky mode corresponding to situations in 
which the optimal arm has a "normal" behaviour (with small variations due to the 
suboptimal arms) and a very thick-tailed mode corresponding to the unlucky start 
described above. Our theoretical study shows that the mass of this mode decays 
only at a polynomial rate controlled by Recall that the larger is, the more all 
arms are explored, the larger the bound on the expected regret is (see (14.2.41) ). In 
our experiments, this mode does appear (see Figure |4~T]) . 



Figure 4.1: Distribution of the pseudo-regret for UCB-V (( = 1) for horizon 
n = 16, 384 (l.h.s. figure) and n = 524, 288 (r.h.s. figure). The bandit problem 
is defined hy K = 2, a Bernoulli distribution with parameter 0.5 and a Dirac 
distribution at 0.495. 

When the time horizon n is known, one may consider the UCB policy with 



i,s,t 



GVi^s log n 9 log n 



(4.2.6) 



n 



which is an upper bound on which holds with probability at least 1 
The associated UCB policy, called hereafter UCB-Horizon, concentrates its ex- 
ploration phase at the beginning of the plays, and then switches to the exploitation 
mode. On the contrary, the UCB-V induced by (14.2.21) . which looks deceptively 
similar to UCB-Horizon (with ( = 3), explores and exploits at any time during 
the interval [l,n]. Both policies have similar guarantee on their expected regret. 
However, on the one hand, UCB-Horizon always satisfies 

— C 
P(i?„ > Clogn) < 



n 



(4.2.7) 



where C and C are quantities depending only on K, (, o'l, ■ ■ ■ , ax, ^i, ■ ■ ■ , ^k, 
which contrasts with the significantly worse tail distribution of UCB-V. On the 
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other hand, unlike UCB-Horizon, UCB-V has the anytime property: the policy 
satisfies the logarithmic expected regret bound for any time horizon n (since its 
pulling strategy does not depend on the time horizon). The open question here is 
thus: could we have both properties? In other words, is there an algorithm that 
does not need to know the time horizon and which regret has a tail distribution 
satisfying (14.2.71) ? We conjecture that the answer is no. 

4.2.6. Distribution-free optimal UCB policy. The inequalities (I4.2.3D 
and (14.2.41) may have surprised the reader since the right-hand sides diverge for Aj 
going to 0. For A/ = o(n^^/^), this is an artefact of the bounds, which is easily 
rectifiable. For instance, for UCBl, the more general bound (but less readable 
one) is 

KRn < max minf— ^logn, tjAj). 



In the worst case (i.e., Ai = and A2 = ■ ■ ■ = = ^/lOK (logn) /n) , 
the right-hand side of the bound is equal to Y^T0ri(^^^^T)logr2. This has to be 
compared with the following lower bound of Auer, Cesa-Bianchi, Freund and 
Schapire [il9il : 

1 



inf sup Ei?„ > — VnK, 

where the infimum is taken over all policies and the supremum is taken over all 
iT-tuple of probability distributions on [0, 1]. We thus observe a logarithmic gap. 
In nulls, Sebastien Bubeck and I close this logarithmic gap, by using a different 
UCB policy based on 



/log max(^, 11 



s 

which, for s < n/K, is an upper bound on /ij which holds with probability at 
least 1 — (Ks/n)^"^ according to Hoeffding's inequality. In this policy, an arm 
that has been drawn more than n/K times has an index equal to the empirical 
mean of the rewards obtained from the arm, and when it has been drawn close to 
n/K times, the logarithmic term is much smaller than the one of UCB 1 , implying 
less exploration of this already intensively drawn arm. For this policy, we prove 

Theorem 20 For A = min Aj, the above policy satisfies 

ie{l,...,K}:Ai>0 

B„<!^,og(,„ax(H^.10')). (4.2.8) 

and 

ERn < 24v^. (4.2.9) 
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This means that this UCB policy has the minimax rate y/nK, while still having 
a distribution-dependent bound increasing logarithmically in n. 

4.2.7. UCB POLICY WITH AN INFINITE NUMBER OF ARMS. When the num- 
ber of arms is infinite (or larger than the available number of experiments), the 
exploration of all the arms is impossible: if no additional assumption is made, it 
may be arbitrarily hard to find a near-optimal arm. In [I129II , Yizao Wang, Remi 
Munos and I consider a stochastic assumption on the mean-reward of any new 
selected arm. When a new arm i is pulled, its mean-reward /ij is assumed to be an 
independent sample from a fixed distribution. Our assumptions essentially char- 
acterize the probability of pulling near-optimal arms. That is, given jj* E [0, 1] as 
the best possible mean-reward and /3 > a parameter of the mean-reward distri- 
bution, the probability that a new arm is 5-optimal is of order for small S, i.e. 
P(/ifc > n* — 6) = <d{5^) for 5 — i- C0. In contrast with the previous many-armed 
bandits [l30l I121L our setting allows general reward distributions for the arms, 
under a simple assumption on the mean-reward. 

When there is more arms than the available number of experiments, the ex- 
ploration takes two forms: discovery (pulling a new arm that has never been tried 
before) and sampling (pulling an arm already discovered in order to gain informa- 
tion about its actual mean-reward). 

Numerous applications can be found e.g. in [l30l . It includes labor markets 
(a worker has many opportunities for jobs), mining for valuable resources (such 
as gold or oil) when there are many areas available for exploration (the miner 
can move to another location or continue in the same location, depending on re- 
sults), and path planning under uncertainty in which the path planner has to decide 
among a route that has proved to be efficient in the past (exploitation), or a known 
route that has not been explored many times (sampling), or a brand new route that 
has never been tried before (discovery). 

In [129], we propose an arm-increasing rule policy. It has the anytime property 
and consists in adding a new arm from time to time into the set of sampled arms. It 
is done such that at time t, the number of sampled arms is of order n^^'^ if /i* < 1 
and /3 < 1, and of order n'^/^^+l^^ otherwise. It uses a modified version of the 
UCB-V policy on this set of arms: specifically, the policy associated with 

- / 4V,,,log(101ogty , 61og(101ogt) 

^i,s,t = -^i,s + \ 1 • 

\ S S 

The pseudo-regret of this policy is still defined as the difference between the 
rewards we would have obtained by drawing an optimal arm (an arm having a 

We write f{S) = Q{g{S)) for (5 ^ when 3ci,C2,£o > such that V(5 < Sq, cig{S) < 
fiS) < C2g{5). 
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mean-reward equal to fi*) and the rewards we did obtain during the time steps 
1, . . . ,n, hence, from the tower rule, Ei?„ = n/i* — J2t=if^h- Its behaviour 
depends on whether /i* = 1 or /i* < 1. Let us write f„ = 0(u„) when for some 
^0, C > 0, Vn < C*Mn(log(M„))^, for all n > uq. For /i* = 1, our algorithms are 
such that Ei?^= 0(n'^/(i+^)). For /i* < 1, wehaveEi?„ = d{n^/''^+^^) if /3 > 1, 
and (only) ER^ = 0{n^/^) if /3 < 1. Moreover we derive the lower bound: for 
any /3 > 0, /i* < 1, any algorithm satisfies ER^ > Cn^/(^+^) for some C > 0. 

In continuum-armed bandits (see e.g. |[D|78l|20l), an infinity of arms is also 
considered. The arms lie in some Euclidean (or metric) space and their mean- 
reward is a deterministic and smooth (e.g. Lipschitz) function of the arms. This 
setting is different from ours since our assumption is stochastic and does not con- 
sider regularities of the mean-reward w.r.t. the arms. However, if we choose an 
arm-pulling strategy which consists in selecting randomly the arms, then our set- 
ting encompasses continuum- armed bandits. For example, consider the domain 
[0, l]'^ and a mean-reward function /i assumed to be locally equivalent to a Holder 
function (of order a E [0, +oo)) around any maximum x* (the number of maxima 
is assumed to be finite), i.e. 

/i(x*) - /i(x) = e(||x* - xf) when x x*. (4.2.10) 

Pulling randomly an arm X according to the Lebesgue measure on [0, 1]*^, we 
have: P(/i(X) > jj,* - e) = Q(P{\\X - x*f < e)) = 0(e^/°), for e ^ 0. Thus 
our assumption holds with (3 = d/a, and our results say that if /i* = 1, we have 
ERn = 0(n^/(i+'^)) = ©(n-^/ ("+'=')). 

For d = 1, under the assumption that n is a-Holder (i.e. \fi{x) — fj,{y)\ < 
c \\x — yW"' for < a < 1), [TTSlI provides upper and lower bounds on the pseudo- 
regret i?„ = e(n("+i)/(2"+i)). Our results gives = 0(ni/("+i)) which is 
better for all values of a. The reason for this apparent contradiction is that the 
lower bound in [78J is obtained by the construction of a very irregular function, 
which actually does not satisfy our local assumption (14.2.101) . 

Now, under assumptions (14.2.101) for any a > (around a finite set of max- 
ima), ^20ll provides the rate Ei?„ = 0{y/n). Our result gives the same rate when 
fi* < 1 but in the case /i* = 1 we obtain the improved rate Ei?„ = 0(n^/*^"+^)) 
which is better whenever a > 1 (because we are able to exploit the low variance 
of the good arms). Note that like our algorithm, the algorithms in [|20ll as well as 
in iTTSl . do not make an explicit use (in the procedure) of the smoothness of the 
function. They just use a "uniform" discretization of the domain. 

On the other hand, the zooming algorithm of [75] adapts to the smoothness 
of /i (more arms are sampled at areas where fi is high). For any dimension d, 
they obtain Ei?„ = 0{n^'^ +2))^ where d' < dis their "zooming dimension". 
Under assumptions (14.2.101) we deduce d' = ^^d using the Euclidean distance as 
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metric, thus their pseudo-regret is Ei?„ = 0(n(^(^-i)+°)/('='("-^)+2°)). For locally 
quadratic functions (i.e. a = 2), their rate is 0{n^'^^'^^^^'^^^^), whereas ours is 
(^^^d/(2+d)^ Again, we have a smaller pseudo-regret although we do not use the 
smoothness of /i in our algorithm. Here the reason is that the zooming algorithm 
does not make full use of the fact that the function is locally quadratic (it considers 
a Lipschitz property only). However, in the case a < 1, our rates are worse than 
algorithms specifically designed for continuum armed bandits. 

Hence, the comparison between the many-armed and continuum- armed ban- 
dits settings is not easy because of the difference in nature of the basis assump- 
tions. Our setting is an alternative to the continuum- armed bandit setting which 
does not require the existence of an underlying metric space in which the mean- 
reward function would be smooth. Our assumption naturally deals with possibly 
very complicated functions where maxima may be located in any part of the space. 
For the continuum-armed bandit problems when there are relatively many near- 
optimal arms, our algorithm will be also competitive compared to the specifically 
designed continuum-armed bandit algorithms. This result matches the intuition 
that in such cases, a random selection strategy will perform well. 

Another contribution of our work is to show that, for infinitely many-armed 
bandits, we need much less exploration of each arm than for finite-armed ban- 
dits: as shown in the next section, the index Bi^s^t is an upper bound on fii which 
holds with probability at least 1 — [log(10t)]~^. The use of this low confidence 
upper bound (compared to the ones of UCBl and UCB-V for instance) can be 
explained by the fact that many sampled arms have a mean really close to the op- 
timal one, and consequently exploiting not the best one but just one of the best 
arms is enough to achieve the minimax pseudo-regret. 

4.2.8. The empirical Bernstein inequality. A key lemma to analyze 
the policies using variance estimates as UCB-V and the one used in the previous 
section is the following maximal inequality, which in particular implies that the 
arm index (|4.2.2I) of UCB-V is an upper bound on fii which holds with probability 
at least 1 — 3t^'^. The interest of the lemma goes beyond the particular setting 
of the multi-armed bandit problems as it provides a non asymptotic confidence 
interval on the expectation of a distribution for which we observe a sample (and 
for which we know a bounded interval containing its support). 

Lemma 2 1 Let U,Ui, . . . ,Un be independent and identically distributed random 
variables taking their values in [0,1]. Let 




and 




i=l 



i=l 
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1. For any e > 0, with probability at least 1 — e, for any t G {1, . . . ,n} and 

nlog(2£^l) 



■-.t — -p — -, we have 



U,-EU < mill 



(^^2it{Vt + it) + it(^ + Vl-SVt^ , yij. (4.2.11) 



2. For any e > 0, with probability at least 1 — e, for any t G {1, . . . , and 

it = "''°^|2^ — ^, we have 



\Ut-m\ <m:mU2Uvt + it)+it(^+^/i-^v^ , ylj. (4.2.12) 



In particular, for any e > 0, with probability at least 1 — £, for any t E {1, . . . , n}, 
we have 

|ft-E6'|<yMMi!5+2!iM^. (4.2.13) 

Inequality (14.2.131) is the one used in [[HI I109L but its tighter version (14.2.121) 
should be preferred. The proof of this lemma is given in Appendix |El Fot t = n, 
the lemma is an empirical version of Bernstein's inequality, which differs from the 
latter to the following extent: the true variance has been replaced by its empirical 
estimate (at the price of having log(3£:^^) terms instead of log(£:^^), and a factor 3 
in the last term in the right-hand side instead of 1/3. Inequality (|4.2.13l) relies on 
the following empirical upper bound of the variance V of U, which simultaneously 
holds with probability at least 1 — e: for any t G {1, . . . , n}, we have 



This bound can be seen as an improvement of Inequality (5.27) of Blanchard (331. 
For t = n > 2, i.e. without the stopping time argument due to Freedman Il57ll 
allowing to have the inequality uniformly over time, Maurer and Pontil HlOlll im- 
proves on the constants of the above inequality when the empirical variance is 
close to 0. Considering the unbiased variance estimator V/ = ^X]l=i(^s — 
= jziVt, they obtain that with probability at least 1 — e. 




2{t- 1) V 2{t-l] 
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Combined with Bernstein's bound, this gives that with probability at least I — £, 



2(t-l) J 3(t-l) ' 

where the gain is on the factor of the logarithmic term when the empirical variance 
is much smaller than log(3£:^^)/t. 

Volodymyr Mnih, Csaba Szepesvari and I [1091 have used Lemma [2T] to ad- 
dress the problem of stopping the sampling of an unknown distribution u as soon 
as we can output an estimate p, of the mean n of u with relative error 6 with 
probability at least I — e, that is 

P(|/t-/i| < > 1 (4.2.14) 

For a distribution u supported by [a, a + 1] for some a G M, we have proposed 



the empirical Bernstein stopping algorithm described in Figure 14.21 It uses a ge- 
ometric grid and parameters ensuring that the event £ = {|f7j — /i| < Q, t > ti} 
occurs with probability at least 1 — e. It operates by maintaining a lower bound, 
LB, and an upper bound, UB, on the absolute value of the mean of the random 
variable being sampled, terminates when (1 + 5)LB < (1 — (5)UB, and returns 
the mean estimate fi = sign([7f ) (i+^)lb+(i-^)ub ^ prove that this output indeed 
satisfies (|4.2.14l) and that the stopping time T of the algorithm is upper bounded 

by 

^ - ^ ■ ^r) d"^ (?) ^ '"^ ( m 

Up to the log log term, this is optimal according to the work of Dagum, Karp, 
Luby and Ross ||49ll . 

Besides, our experimental simulations show that it significantly outperforms 
previously known stopping rules, in particular AA [|49ll and the Nonmonotonic 
Adaptive Sampling (NAS) algorithm due to Domingo, Gavalda and Watanabe 
ni30l[54l . Figure |43] shows the results of running different stopping rules for the 
distribution u of the average of 10 uniform random variables on [/i — 1/2, /i + 1/2] 
with varying /i and also on Bernoulli distributions. The experience is repeated 
a hundred times so that the differences observed in Figure 14.31 are statistically 
significant. 

We also use the empirical Bernstein bound in the context of racing algorithms. 
Racing algorithms aim to reduce the computational burden of performing tasks 
such as model selection using a hold-out set by discarding poor models quickly 
[|98l I112II . The context of racing algorithms is the one of multi-armed bandit 
problems. Let e > be the confidence level parameter. A racing algorithm either 
terminates when it runs out of time (i.e. at the end of the n-th round) or when it 
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Parameters of the problem: 8, e and the unknown distribution v. 

Parameters of the algorithm: g > 0, > 1 and a > 1 defining the geometric grid 

tfc = [atfc^i]. (In our simulations, we take q = 0.1, ti = 20 and a = 1.1.) 

Initialization: 

r- 3 

^ ~ etHl-a-1) 

LB ^ 

UB ^ oo 

For t = 1, . . . , ti — 1, 

sample Ut from u 
End For 

ForA; = 1,2,..., 

For t = ifc,. . . - 1, 
sample Ut from v and compute the empirical mean Ut = j Yll=i Us 

ct = min ( V2£t(I4 + it) + (5 + \/l - m) , 

LB ^max(LB, |C7i| - q) 

UB^min(UB, \Ut\ + q) 

If (1 + 6)LB < (1 - 6)\JB, Then 

stop simulating [/ and return the mean estimate sign(i7f ) (i+^)lb+(i-'?)ub 
End If 
End For 
End For 



Figure 4.2: Empirical Bernstein stopping (EBGStop* in our experiments). 



can say that with probability at least 1 — £, it has found the best option, i.e. an 
option i* G argmax.g|^ 

The Hoeffding race introduced by [|98l is an algorithm based on discarding 
options which are likely to have smaller mean than the optimal one until only one 
option remains. Precisely, for each time step and each distribution, ^-confidence 
intervals are constructed for the mean. Options with upper confidence smaller than 
the lower confidence bound of another option are discarded. The algorithm sam- 
ples one by one all the options that have not been discarded yet. Our empirical and 
theoretical study show that replacing the Hoeffding's inequality by the empirical 
Bernstein bound leads to significant improvement. In particular. Table 14. ll shows 
the percentage of work saved by each method (1— number of samples taken by 
method divided by Kn), as well as the number of options remaining after termi- 
nation (see HI 091 for a more detailed description of the experiments). 
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H EBGStOp* 
H EBGStOp 
I — I EBStOp 
□ AA 

NAS 
H Geo NAS 



lilll 



Figure 4.3: Comparison of stopping rules on (l.h.s. figure) averages of uniform 
random variables with varying means and (r.h.s. figure) Bernoulli random vari- 
ables with means 0.99, 0.9, 0.5, 0.1, 0.05, and 0.01, averaged over 100 runs. The 
average number of samples is shown in log scale. 
Table 4. 1 : Percentage of work saved / number of options left after termination. 



Data set Hoeffding Empirical Bernstein 

SARCOS 0.0%/ 11 44.9%/ 4 

Covertype2 14.9% / 8 29.3% / 5 

Local 6.0%/ 9 33.1%/ 6 



4.2.9. Best arm identification. Racing algorithms [98] try to identify the 
best action at a given confidence level while consuming the minimal number of 
pulls. They essentially try to optimize the exploration "budget" for a given con- 
fidence level. In some applications, the budget size is fixed (say n rounds), and 
one may want to predict the best arm at the end of the n-th round. A motivating 
example concerns channel allocation for mobile phone communications. During 
a very short time before the communication starts, a cellphone can explore the set 
of channels to find the best one to operate. Each evaluation of a channel is noisy 
and there is a limited number of evaluations before the communication starts. The 
connection is then launched on the channel which is believed to be the best. 

More formally, the setting of identifying the best arm is summarized in Fi- 
gure |4]4j It differs from the traditional multi-armed bandit problem by its target: 
the cumulative regret is no longer appropriate to measure the performance of a 
policy. The aim is rather to minimize the simple regret: 

rn = Aj„, 

where J„ is the final recommendation of the algorithm and A j still denotes the gap 
between the mean reward of the best arm or the mean reward of the selected arm. 
Let i* still denote the optimal arm. The simple regret is linked to the probability 
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Parameters available to the forecaster: the number of rounds n and the number of 
arms K. 

Parameters unknown to the forecaster: the reward distributions ui, . . . ,vk of the 
arms. 

For each round t = 1, 2, . . . , n; 

(1) the forecaster chooses It £ {I, ... , K}, 

(2) the environment draws the reward from uj^ and independently of the 
past given It. 

At the end of the n rounds, the forecaster outputs a recommendation J„ G 
{l,...,K}. 



Figure 4.4: Best arm identification in multi-armed bandits. 

of error 

en = P(Jn ^ t*), 

since, from Er„ = Y.i^i* ^i^n = we have mini.A,>o Aje„ < Er„ < e„. 

In [fT3L Sebastien Bubeck, Remi Munos and I prove that UCB policies can 
still be used provided that the exploration term is taken much larger: precisely, for 
H = Xlt A >o ^ numerical constant c > 0, we introduce the UCB-E (E 

for exploration) policy characterized by 

Bi,s,t = Xi,s + 

which is an extremely high confidence upper bound on /ij (probability at least 
1 — exp(— ^), hence much higher than the confidence level of UCBl and UCB- 
V), and by taking J„ as the arm with the largest empirical mean. We also propose 
a new algorithm, called SR, based on successive rejects. We show that these algo- 
rithms are essentially optimal since their simple regret decreases exponentially at 
a rate which is, up to a logarithmic factor, the best possible. However, while the 
UCB policy needs the tuning of a parameter depending on the unobservable hard- 
ness of the task, the successive rejects policy benefits from being parameter-free, 
and also independent of the scaling of the rewards. As a by-product of our anal- 
ysis, we show that identifying the best arm (when it is unique) requires a number 
of samples of order H (up to a \og{K) factor). This generalizes the well-known 
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Experiment 7, n=6000 Experiment 7, n=12000 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 1 1 1 12 1 3 14 



Figure 4.5: Probability of error of different algorithms for n = 6000 (l.h.s.) and 
n = 12000 (r.h.s.), and K = 30 arms having Bernoulli distributions with param- 
eters 0.5 (one arm), 0.45 (five arms), 0.43 (fourteen arms), 0.38 (ten arms). Each 
bar represents a different algorithm and the bar's height represents the probability 
of error of this algorithm. "Unif" is the uniform sampling strategy, "HR" is the 
Hoeffding Race algorithm (run for three different values of the confidence level 
parameter), UCB-E is tested for four different values of c: 2, 4, 8, 16, Adaptive 
UCB-E is tested for five different values of its parameter. More extensive exper- 
iments are presented in [13] and confirm the ranking of algorithms observed on 
these simulations: Ad UCB-E > SR > HR > Unif, where '>' means 'has better 
performance than'. (UCB-E is not ranked as it requires the knowledge of H.) 

fact that one needs of order of 1 / samples to differentiate the means of two dis- 
tributions with gap A. A precise understanding of both SR and the UCB-E policy 
leads us to define a new algorithm. Adaptive UCB-E. It comes without guaran- 
tee of optimal rates, but performs slightly better than SR in practice as shown in 
Figure 1431 

Another variant of the best arm identification task is the problem of mini- 
mal sampling times required to identify an e-optimal arm with a given confidence 
level, see in particular [54| and [56|. In [62j, Steffen Griinewalder, Manfred Op- 
per, John Shawe-Taylor and I also study a non-cumulative regret notion, but in 
the context of a continuum of arms. Precisely, we consider the scenario in which 
the reward distribution for arms is modelled by a Gaussian process and there is no 
noise in the observed reward, and provide upper and lower bounds under reason- 
able assumptions about the covariance function defining the Gaussian process. 



4.3. Sequential prediction 

This section summarizes my work with Sebastien Bubeck [[Till . It starts with 
the adversarial bandit problem, and goes on with the extension to other sequential 
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prediction games. 

4.3.1. Adversarial bandit. In the general bandit problem, the environment 
is not constrained to generate the reward vectors independently as in the stochastic 
bandit problem. However, the target is still to minimize the regret 

n 

t=l 

In the most general form of the game, called the non-oblivious/adaptive adver- 
sarial game, the adversary may choose the reward vector gt as a function of the 
past decisions Ii, . . . ,It-i. Upper bounds on the regret i?„ for this type of adver- 
sary have a less straightforward interpretation since the target cumulative reward 
is now depending on the agent's policy! I will not provide here results for this 
type of adversary but the extension of the results presented hereafter can be found 

in ma. 

Thus we will focus on the oblivious adversarial bandit game, in which the 
reward vector gt is not a function of the past decisions Ii, . . . , It-i- The environ- 
ment is then simply defined by a distribution on [0, 1]"^^, while the agent's policy 
is still defined by a mapping, denoted from Lite{i,...,n-i} ({1, • • • , K} x [0, 1]) 
to the set of distributions of {1, . . . , K}. Now we can see the game a bit dif- 
ferently. The "master" of the game draws a matrix {gi,t)i<i<K,i<t<n from the 
distribution defining the environment, and at each time step t, draws the arm 
It according to the distribution pt = ^^Kt) chosen by the agent, where Jit = 
{(/i, 5f/^^i), . . . , {It-i, gi^_^^t-i)} is the past information. The regret Rn is a ran- 
dom variable since it depends on the draw of the reward matrix and the draws 
from the distributions pt's. 

In [fT9ll , Auer, Cesa-Bianchi, Freund and Schapire have shown that a fore- 
caster based on exponentially weighted averages has a regret upper bounded by 
2.7 y/nK \ogK. As stated before, they also show that this is optimal up to the 
logarithmic factor: precisely, there is no forecaster satisfying Ei?„ < ^y/nK, 
for any environment. In [ iTlfTll . we close the logarithmic gap between the above 
upper and lower bounds by introducing a new class of randomized policies. To 
define it, consider a function ^ : M*_ M.\_ such that 

increasing and continuously differentiable, 
^/^'/^ nondecreasing, (4.3.1) 

lim„^_oo V'(m) < l/K, andlim„^o V'(^i) > 1- 

It can be easily shown that there exists a continuously differentiable function C : 
— )■ M satisfying for any x = (xi, . . . , xk) G R^, 

max Xj < C(x) < max Xi — ip'^ / K) , (4.3.2) 
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Pai-ameter: function ^ : M* — > M* satisfvina (14.3.11) 
Let pi be the uniform distribution over {1, . . . , K}. 






Pnr pjipVi roiinH / — 1 9 






(1) Draw an arm It from the probability distribution pt. 






(2) Compute the estimated gain for each arm: gi^t = 
estimated cumulative gain: Gi^t = Yll=i 9i,s- 




anjf] update the 


(3) Compute the normalization constant Ct = 

[Gi^t, ■ ■ ■ , GK,t)- 


GiGt) 


where Gt = 


(4) Compute the new probability distribution pt+i = {p 


i,t+i, ■ ■ ■ 


,PK,t+i) where 


Pi^t+i = ipiGi,t - Gt). 






" It estimates gi t even when gi^t is not observed since Eg^ f = 


= 9i.t- 





Figure 4.6: INF (Implicitly Normalized Forecaster) for the adversarial bandit. 



and 

K 

J2Hxi-C{x)) = l. (4.3.3) 

i=l 

So we can define the implicitly normalized forecaster (INF) as detailed in Figure 
14. 6[ Indeed, Equality (14.3.31) makes the fourth step in Figure |4]6] legitimate. From 
(14.3.21) . C{Gt) is roughly equal to maxj=i ... /< Gi^f This means that INF chooses 
the probability assigned to arm i as a function of the (estimated) regret. In spirit, 
it is similar to the traditional weighted average forecaster, see e.g. Section 2.1 
of [|46ll . where the probabilities are proportional to a function of the difference 
between the (estimated) cumulative reward of arm i and the cumulative reward 
of the policy, which should be, for a well-performing policy, of order C{Gt)- 
Weigthed average forecasters and implicitly normalized forecasters are in fact two 
different classes of forecasters which intersection contains exponentially weighted 
average forecasters such as the one considered in [[T9ll . The interesting feature of 
the implicit normalization is the following argument, which allows to recover the 
result of H19II and more interestingly to propose a policy having a regret of order 
\/nK. It starts with an Abel transformation and consequently is "orthogonal" 
to the usual argument which, for sake of comparison, has been reproduced in 
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Appendix El Letting Gq = G M^. We have 

n n K 

t=l t=l i=l 

n K 

= '^'^PiAGi,t - Gi,t-l) 
t=l i=l 

K K n 

= ^Pi,n+lGi,n + X] GiAPi,t - Pi,t+l) 

i=l i=l t=l 

K K n 

= ^Pi,n+1 + Gn) ^{ip'^ (pi^t+l) + Ct){pi,t " Pi,t+l] 

i=l i=l t=l 

K K n 

= Gn + ^Pi,n+lij~^{Pi,n+l) + ^^i^'^ {Pi,t+l){Pi,t " Pi,t+l), 
i=l i=l t=l 

(4.3.4) 

where the remarkable simplification in the last step is closely linked to our specific 
class of randomized algorithms. The equality is interesting since, from (14.3.21) . C„ 
approximates the maximum estimated cumulative reward maxj=i Gi^n, which 
should be close to the cumulative reward of the optimal arm maxj=i ... Gi^n, 
where Gi^n = Ylt=i 9i,t- Besides, the last term in the right-hand side is roughly 
equal to 

^ rPi,t+i ^ rPi,n+i 

EE / i^-\^^)du = E / r\u)du 
i=i t=i -^p-.t i=i -^^1^ 

To make this precise, we use a Taylor-Lagrange expansion and technical argu- 
ments to control the residual terms. Putting this together, we roughly have 

~ K K j_i 



lb n 

max^Gi,„ -^gia ^ - Ep«>«+i^~^(Pi-"+i) + E / 

t=l i=l i=l "^1/ 



'?/' ^{u)du. 



The right-hand side is easy to study: it depends only on the final probability vector 
and has simple upper bounds for adequate choices of if). For instance, for = 
exp(?7a:) + with 77 > and 7 G [0, 1), the right-hand side is smaller than 
V log (1^) + iCn. For ij{x) = {^Y + ^ with r/ > 0, g > 1 and 7 G [0, 1), 
it is smaller than -^rfK^/'^ + 7C„. For sake of simplicity, we have been hiding 
the residual terms but these terms when added together {nK terms!) are not that 
small, and in fact constrain the choice of the parameters 7 and rj if one wishes to 
get the tightest bound. Our main result is the following. 
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Parameters: the number of arms (or actions) K and the number of rounds n with 

n>K>2. 

For each round i = 1, 2, . . . , n 

(1) The forecaster chooses an arm It € {!,..., K], possibly with the help of an 
external randomization. 

(2) Simultaneously the adversary chooses the reward vector 

gt = {gi,t,---,9K,t) G [0,1]^ 

(3) The forecaster receives the gain gj^^t (without systematically observing it). He 
observes 

- the rewai^d vector {gi^t, ■ ■ ■ , 9k, t) in the full information game, 

- the reward vector {gi^t, ■ ■ ■ ,gK,t) if he asks for it with the global con- 
straint that he is not allowed to ask it more than m times for some fixed 
integer number 1 < m < n. This prediction game is the label efficient 

game, 

- only J in the bandit game, 

- only his obtained reward gj^^t if he asks for it with the global constraint 
that he is not allowed to ask it more than m times for some fixed integer 
number 1 < m < n. This prediction game is the bandit label efficient 
game. 

Goal : The forecaster tries to maximize his cumulative gain Yl^=i 9it,t- 



Figure 4.7: The four prediction games considered in this section. 

Theorem 22 The INF algorithm with iIj{x) = (%^)^ + ^7= satisfies 

ERn < llv^. 

4.3.2. Extensions TO OTHER SEQUENTIAL PREDICTION GAMES. Letusnow 
describe a more general setting, in which the feedback received by the forecaster 
after drawing an arm differs from game to game. The four games are detailed in 
Figure I4.7[ As for the weighted average forecasters, the INF forecaster can be 
adapted to the different games by simply modifying the estimates gi^t of gi^t- The 
resulting slightly modified INF forecaster is given in Figure 1481 Interestingly, we 
can provide a unified analysis of these games for the INF forecaster. It allows to 
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essentially recover the known minimax bounds, while sometimes improving the 
best known upper bound by a logarithmic term. It also leads to high probability 
bounds on the regret holding for any confidence level, which contrasts with previ- 
ously known results. Let us now detail the main results for the last three games of 
Figure 148] and for the tracking the best expert scenario. 



INF (Implicitly Normalized Forecaster): 
Parameters: 

• the continuously differentiable function ■0 : satisfying (14.3.11) 

• the estimates gi t of gi^t based on the (drawn arms and) observed rewards at 
time t (and before time t) 

Let pi be the uniform distribution over {1, . . . , K}. 
For each round t = 1, 2, . . . , 

(1) Draw an arm It from the probability distribution pt. 

(2) Use the (potentially) observed reward(s) to build the estimate gt = 

{gi,t, • • • , gK,t) of {gi,t, gK,t) and let: Gt = Es=i 9s = {Gi,t, GK,t)- 

(3) Compute the normalization constant Ct = C{Gt). 

(4) Compute the new probability distribution pt+i = {pi^t+i, ■ ■ ■ ,PK,t+i) where 

Pi^t+i = tp{Gi,t - Ct). 



Figure 4.8: The proposed policy for the four prediction games. 



The label efficient game. This game was introduced by [f66|: as explained in 
Figure 14.71 the forecaster observes the reward vector only if he asks for it, and 
he is not allowed to ask it more than m times for some fixed integer number 
1 < m < n. Following the work of Cesa-Bianchi, Lugosi and Stoltz [|47il . we 
consider the following policy for requesting the reward vector. At each round, we 
draw a Bernoulli random variable Zt, with parameter 5 = to decide whether 
we ask for the rewards or not. To fulfil the game requirement, we naturally do not 
ask for the rewards if X]l=i > ^■ 



THEOREM 23 Let ip{x) = exp ( v^^M^x) and gi,t = ^-fZt with 6 = ^. Then, 
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for any e > 0, with probability at least 1 — e, INF satisfies: 



fl„<2,„/'!2iZ + „,/2"og(2X.-.) 



m V m 

hence 

l\og(6K) 

Ei?„ < 8n\ 

V m 

This theorem is similar to Theorem 6.2 of [|46ll . The main difference and 
novelty is that the policy does not depend on the confidence level, so the high 
probability bound is valid for any confidence level for the same policy, and the 

expected regret of this policy has also the minimax optimal rate, i.e. n\J ^°^^\ 

High probability bounds for the bandit game. Here the main difference with 
Section l43] is to use the biased estimates cji^t = ^it=i + ^ for some appropriate 
small f3 > 0. It may appear surprising as it introduces a bias in the estimate of 
Qi^f However this modification allows to have high probability upper bounds with 
the correct rate on the difference J2t=i 9ht ~ X]"=i 9i,t- ^ second reason for this 
modification (but useless for this particular section) is that it allows to track the 
best expert (see Section 14.3.21) . For sake of simplicity, the following theorem 
concerns deterministic adversaries (which is defined by a fixed matrix of the nK 
rewards). 



Theorem 24 For a deterministic adversary, The INF algorithm with tplx) = 

i^^Y '^^^ ~ p^^^*=* ~'~ p \jnK ^^^^^fi^^' fa^ £^ > 0, with proba- 

bility at least 1 — e. 



Rn < loVnK + 2v^log(e~^). 



(Consequently, it also satisfies E.Rn < 12\/nK.) 

The novelty of the result, which is similar to Theorem 6.10 of [|46ll . is both 
the absence of the log K factor and that the high probability bound is valid for the 
same policy at any confidence level. 

Label efficient and bandit game (LE bandit). In this game first considered by 
Gyorgy and Ottucsak Il64l and which is a combination of two previously seen 
games, the forecaster observes the reward of the arm he selected only if he asks 
for it, and he is not allowed to request it more than m times for some fixed integer 
number \ < m < n. We consider a similar policy for requesting the reward 
vector as in the label efficient game. At each round, we draw a Bernoulli random 
variable Zt, with parameter 5 = to decide whether we ask for the obtained 
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reward or not. To fulfil the game requirement, we do not ask for the rewards if 
El=l > m. 

Theorem 25 Forij{x) = (3^)' + 7= and gi^t = 9i/-^T' the INF algo- 
rithm satisfies 

[k 

Ei?„ < 40nW— . 

V m 

As for the bandit game, the use of the INF forecaster allows to get rid of the 
log K factor which was appearing in previous works. 

Tracking the best expert in the bandit game. In the previous sections, the cumu- 
lative gain of the forecaster was compared to the cumulative gain of the best sin- 
gle expert. Here, it will be compared to more flexible strategies that are allowed 
to switch actions. A switching strategy is described by a vector (zi, . . . , z„) G 
{1, . . . , f^}". Its size is defined by 

n-l 

S(ii, . . . = ^ lit+iT^it, 

and its cumulative gain is 



t=i 



Ht,t- 

t=l 



The regret of a forecaster with respect to the best switching strategy with S 
switches is then given by: 



n 

= . ,i^ax ^ ^^(n,...,^,,) - git,t- 



(il,...,i„): S(ji,...,i„)<S 

As in Section l4.3.2[ we use the estimates 

9i,t = gi,t \ , 

Pi,t Pi,t 

and < /3 < 1 . The /3 term, which, as already stated, introduces a bias in the 
estimate of gi^t, constrains the differences maxj=i Gj f — min^^i Gj,t to be 
relatively small. This is the key property in order to track the best switching strat- 
egy, provided that the number of switches is not too large. We have the following 
result for the INF forecaster using the above estimates and an exponential func- 
tion 'ip (recall that for exponential ip, the INF forecaster reduces to the traditional 
exponentially weighted forecasters). 
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Theorem 26 Let s = 5'log (^) + \og{2K) with e = exp(l) and the natural 
convention S\og{enK / S) — Ofor S — 0. Consider ip{x) — explrjx) + ^ with 
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Appendix A 

Some basic properties of the Kullback-Leibler 

divergence 

The KL divergence between two distributions on some measurable space S 

ir(p,.) = p-^°S^'^^)) '"P^!" (A.1) 
1^ +00 otherwise 

satisfies for p <^ n, K(p,n) = Eg^^x(^((7)), with x the function defined on 
(0, +00) by xiu) u log(u) + 1 — M. Since the function x is nonnegative and 
equals zero only at 1, we have 

K{p,n)>0, (A.2) 

and 

K{p,tt) = ^ p = TT. (A.3) 
Let /i : g ^ R s.t. Eg^^e'^^s) < +00. Define 

"^^^^^ = E,,^.eM.O ■ 

By expanding the definition of the KL divergence K{p, TCh), we get 

K{p, Tih) = K{p, tt) - Eg^ph{g) + logE^^^e'^^^), 

which implies from (IA.2I) and (IA.3I) 

sup {E3^p/i((7) - ir(p, tt)} = logE^^^e'^^s), (A.4) 
p 

and 

argmax^{Eg^p/i(5() - K{p, vr)} = TTt- (A.5) 

By differentiating, one may note that the function A 1— )■ K(nxh, tt) is nondecreas- 
ing on [0, +00). Finally, if S is finite and tt is the uniform distribution on S, we 
have 

Kip,n) = log(|g|) -Hip)< log(|g|), (A.6) 
where H{p) = — J2ge9 Pid) ^"^S Pig) is the Shannon entropy of p. 
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Appendix B 
Proof of McAllester's PAC Bayesian bound 



McAUester's bound (IMcAl) (pjS]) states that with probability at least 1 — e, for any 
p e M, we have 



w ^ w tt) + log(2n) + logje-^) 
^g^pR{9) - ^g^pr{9) < d _ ^ • (B.l) 

Here is a short proof of this statement that essentially follows the one proposed by 
Seeger. 

Let us first recall that a real-valued random variable V such that Ee^ < 1 
satisfies: for any e > 0, with probability at least 1 — e, we have V < log(£:~^). So 
to prove (IB.ll) . we only need to check that the random variable 

V = sup {(2n - 1) [max {E,^df)R{f) - Ep(d/)r(/), O)]' - K{p, rt) - log(4n)| 
satisfies Ee^ < 1. 

From Jensen's inequality applied to the convex function x i— )■ [max(x, 0) 
and the Legendre transform of the KL divergence (IA.4I) . we have 

V < sup {{2n - l)Ep(rf/)[max - r(/), O)]^ - K{p,it) - log(2n)} 

hence 



log(2n) + logE^(,^)e(2"-i)['"^^(^(^)-'^(^)'°)]', 



Ee^ < ^EE^(,;)e(2«-i)I--(«(/)-^(/)'0)]= 



2n 



^^n{df) (l + E{e(2"-i)['°^-(^(/)-'-(/)'0)l' _ 1 } j from Fubini's theorem 



1 / /"+°° o, log(i + l) \ 

< — E^(rf/) ( 1 + y ^ ^""^ '^M fro"^ Hoeffding's inequality 



2n 
= 1, 

which ends the proof. 
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Appendix C 
Proof of Seeger's PAC Bayesian bound 



Here we sketch the proof of dS]) (pH]), which states that with probability at least 
I — e, for any p G M, we have 

K{Eg^pr{g)\\Eg^pR{g)) < , (C.l) 

where K{q\\p) = K(Be{q),Be(p)) with Be(g) and Be{p) denoting the Bernoulli 
distributions of parameter q and p. The proof follows the same line as the one of 
(|McAI) . We introduce 

V = sup {nir(E,(,;)r(/))||E,(,;)i?(/)) - Kip,n) - log(2v^)|, 

and as in the previous proof, we only need to check that Ee^ < 1. This is done 
by using Jensen's inequality for the convex function {q,p) i— t- K{q\\p) and using 
the Legendre transform of the KL divergence (IA.4I) . We have 

Ee^ < ^e''''^p{"'^P(df)K{r{f)\\R{f))-K{p,7T)-log{2y^)} 



2v^ 

<i, 

where the last inequality is obtained from computations using Stirling's approxi- 
mation. 

The same procedure can be used to prove the other PAC-Bayesian bounds of 
Chapter [21 Section I2i2l A similar way of approaching PAC-Bayesian theorems is 
given in ll60l . 
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Appendix D 

Proof of the learning rate 
of the progressive mixture rule 



Here is the proof in a concise form under the boundedness assumptions of The- 
orem [6] that the expected excess risk of the progressive mixture rule is upper 
bounded by ^l^^i^ for A > |. The condition on A guarantees that for any 
y G [—1, 1], the function y' ^-Kv-v'Y is concave on [—1,1]. Thus we can 
write 

<^Y.^R{^9-^-... 9) (D.l) 

1=0 

1 " 

— ^ E^,+i [y,+i - E,.._,,^ g{Xi+i)f (D.2) 



n , _ 

i=0 



1 " 

— E^„+x Y,%+i - 9{X,^i)f (D.3) 



n + 1 

4=0 



^;r^^^r- E I - ll°gE.-^-... e-[---^(--^)n (D.4) 

i=0 ^ 

= ~ A(;^^^""' ^"^^^-'^ e-^^"-^^(^) (D.5) 

< — — + 1 log 



-^^^^«) + A(;rTi)' 

where (ID. II) comes from Jensen's inequality on the convex function y' ^ (y — 
y'y, (ID. 21) uses that the distribution 7r_ASi depends only on Zl, (ID. 41) comes from 
Jensen's inequality on the concave function y' i— )■ e^^*^^^^ ^ , and (ID. 51) is the core 
of the proof and explains why PM is based on a Cesaro mean. The steps (ID. 21) and 
(ID. 31) are exactly the two steps of the proof of Lemma |7J Note that this analysis 
gives a result similar to the one in Theorem |6l except that the factor 2 is replaced 
by ^ > 8. For the progressive indirect mixture rule, E^^tt^ae replaced by hi, 
and the step (|D.4I) is still valid from the very definition (13.2.11) of hi. 
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Appendix E 
The empirical Bernstein's inequality 



The goal of the empirical Bernstein's inequality is to provide confidence bounds 
on the expectation of a distribution with bounded support, say in [0, 1], given 
a sample from it. Let f/, f/i, f/2, . . . be independent and identically distributed 
random variables taking their values in [0, 1]. Let 



t _ 1 * 

Ut = jY.^^ and Vt = jY.^U,-Ut 

i=l i=l 



2 



Here we prove the empirical Bernstein's inequality (Lemma [2ll pl5T]). which 
states that for any £ > 0, with probability at least 1 — 2e, for any t E {I, . . . ,n} 
and it = — ^, we have 

[/i-E[/<min^i/2^t(^+^ + £i(^ + V^r^3^) , j • (E.l) 

Proof. Let A(A) = log Ee'^^^"'^^) be the log-Laplace transform of the random 
variable U — EU. Let 5*^ = XlLil^* ~ EUi) with the convention 5*0 = 0. From 
Inequality (2.17) of [|67l, we havff 



P( max St > s) < inf e 

^ l<t<n ' A>0 



-As+nA(A) 



Let V = Var U. Hoeffding's inequality and Bennett's inequality implies 



A(A) < min f — , (e^ - 1 - X)V 
\ 8 



which by standard computations (see, e.g.. Inequality (45) of IfTSl ) gives that for 
any e > 0, with probability at least 1 — e. 



ma. g, < .n... (J^^^°^, V 2„V los(E-) + ^^S^) . (E.2) 

l<t<n \\ 2 O / 



n 



'This comes from a martingale argument due to Doob. For any A > 0, the se- 
quence (e'^'^'^*^'^'^-')t>o is a martingale with respect to the filtration (a(?7i, . . . , C^t))t>o ^^'^'-^ 
E(e^'^'"*^(^)|J7i, . . . , = e-^-^'-i-t*"!)^*^). Introduce the stopping time T = min (r 

1, min{i e N : St > s}). From the optional stopping theorem, for any A > 0, we have 

1 = Ee^'^^-^'^(^) > P(T < n)e^"-"'^(^), 

hence 

P( max St>s)^ P(T < n) < inf e--^''+"'^(^) . 

^l<t<n ' A>0 
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LetW = {U-EUf mdWi = {Ui-EUiffori > 1. Let^^ = 

and A' (A) = logEe^(-^+'^'^). As above, from Inequality (2.17) of |l67l, we have 

P( max S'>s)< inf 6"^^+"^'^. 

^ l<t<n ' A>0 

2 

Now using that e~" < 1 — m + ^ for m > and log(l + m) < m from m > — 1, 
we have logEe"^^ < logE(l - \W ^ ^) < -\EW + ^EiyV'^), hence 
A'(A) < ^E(H^^). Optimizing with respect to A gives that for any £ > 0, with 
probability at least 1 — e, 



max S[ < V2nE(iy2) log(£-i). (E.3) 

l<t<n 

Now we use the following lemma to bound E(Vr^). 

Lemma 27 A random variable U taking its values in [0, 1] satisfies 

E[{U -EUY]<V{l-W), (E.4) 

where V = K[{U — Ef/)^] is the variance ofU. IfU admits a Bernoulli distribu- 
tion, one can put an equality in (|E.4I) . 

Proof. We have 

E[{U - EUf] - V{1 - W) = E{[U^-U + E{U)][U - E{U)]) 

+ 3([E(f/2)]2 -E([/)E([/^)). 

From Chebyshev's association inequality (also referred to as the Fortuin-Kasteleyn- 
Ginibre inequality), both terms in the right-hand side are nonpositive. An alterna- 
tive proof consists in expanding the terms in Lemma 8 of HIOIH and noticing that 
this exactly gives (IE.4I) . The result for Bernoulli distributions comes from direct 
computations. □ 

Combining the above lemma with (IE.3I) . we get that with probability at least 

l-e, 

max S't < ^/2nV{l^^W)\og(F^. (E.5) 

l<t<n 

We now work on the event £ of probability at least 1 — 25 on which both (IE. 51) 
and (IE.2I) hold. The variance decomposition gives Vt = j^l^iiUi — Ut)"^ = 
-{EU - Utf + \ Y!i=iWi, hence S[ = tiV - Vt) - t{EU - Utf. For any 



1 v^t 

t 

1 < t < n, we have 



tJt-EU < min ^/wTt + |^ , (E.6) 
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and 

V -Vt< sj2V{l-W)It + {Ut - mf (E.7) 

If tJt < MJ, then dOl) is trivial. If Vt > V, (IeH) is a direct consequence of (lR6l) 
(since | — 3\4 > | — f > |)- Therefore, from now and on, we consider Ut > Ef/ 
and Vt < V. Then (lEB implies (f/* - Ef/)^ < 1^/2, and ^J} leads to 

> ^ - ^2^1 - mh - m ^ (w - ySii^^ ^ - 

hence 

By plugging this inequality into (IE.6I) . we get (IE.2I) . For the two-sided inequality 
(I4.2.12I) . one just needs to add the same inequality as (IE.6I) for —Ut. At the end, 
three maximal inequalities are used (corresponding to U, —Ui and — H^i), so that 
the result holding with probability at least 1 — e contains log(3£:~^) terms. □ 
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Appendix F 



On Exploration-Exploitation with Exponential 

weights (EXP3) 



F. 1 . The variants of EXP3 



Parameters: t] G (0, 1/K] and 7 G [0, 1]. 

Let pi be the uniform distribution over {1, . . . , K}. 

For each round t = 1,2, . . . , 

(1) Draw an arm It according to the probability distribution pt. 

(2) Compute the estimated gain for each arm: 

^ I/t=i foi" the reward-magnifying version of EXP3 

1 — -^j^ I/t=j for the loss-magnifying version of EXP3 
9i,t = <, ihl ij . + J- for the tracking version of EXP3 

p, ^ — I/t=j for the tightly biased version of EXP3 

and update the estimated cumulative gain: Gi^t = X^s=i 

(3) Compute the new probability distribution over the arms: 

Pt+i = 7Pi + (1 - l)Qt+i, 

with 



exp ( r]Gi^i 

<li,t+l 



Figure Fl: EXP3 (Exploration-Exploitation with Exponential weights) for the 
adversarial bandit problem. 

There are several variants of EXP3. They differ by the way g^ t is estimated 
as shown in Figure IFll For deterministic adversaries, the loss-magnifying ver- 
sion of EXP3 has the advantage to provide the best known constant in front of 
the \/nK log K term, that is \/2 (note that our work succeeds in removing the 
log K term but at the price of a larger numerical constant factor). For determinis- 
tic adversaries, the reward-magnifying version of EXP3 (which is the one in the 
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seminal paper of Auer, Cesa-Bianchi, Freund and Schapire [fT9l for 7 = Ki]) has 
the advantage that the factor n in ^/nK log K can be replaced by maxj=i ... ,1 Gi^n, 
where Gi^n = J2t=i 9ht- The tracking version of EXP3 is the one proposed in 
Section 6.8 of f46] (and the one presented in Section l4.3.2l) . It (slightly) overesti- 
mates the rewards since we have E/^^p^^j ^ = gfj f + This idea was introduced 
in [fTTll for tracking the best expert. In [12J, we have introduced the tightly biased 
version of EXP3 to achieve regret bounds depending on the performance of the 
optimal arm. Contrarily to the reward-magnifying version of EXP3, these bounds 
hold for any adversary and high probability regret bounds are also obtained. 



F.2. Proof of the learning rate of the reward-magnifying EXP3 



Here we give an analysis of the reward-magnifying EXP3 (defined in Fig- 
ure IRT]), which is an improvement (in terms of constant only) of the one in |fT9l 
Section 3]. 

Theorem 28 Let G^ax = iiiaxj=i_ x Cj ^. For deterministic adversaries, if 
ArjK < 57, the expected regret of the reward-magnifying EXP3 satisfies 

\ogK 

ERn < + -fG^a,. 

V 

In particular, ifi]= sj^^^ and 7 = min (y\J~^^^^i 1 we have 

Ei?„ < ^^nK\og{K). 

Proof. The condition ArjK < 87 is put to guarantee that ^{^) < 
where : m h-> ''""g"" is an increasing function. For any adversary, we have 

n n 

t=i t=i 



1-7 

V 

1-7 

V 



n , 

^ ( logEi^^^e"^'-* - log [e"^^'=~'"^'=^*Ei^g,e''^"'*] 

(5-X^log(A)), 

^ t=i ' 



where 
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and 

= e " II + r] ^— + ^(^— jr/ Ei^g^g^^^ \ 

V 1-7 1-7 1-7 / 

(F.2) 

< 1. 

„ ^ ,„ ^2 _ 9h,t ~ 

For (IF.3I) . we used ?7i^\l/ (2^) < 7. We have thus proved 

1-7 

, 9iut ^ 

t=i 

For a deterministic adversary, we have EGj „ = EGj „ = Gj^n, so that 

" 1 — 

> ^ElogE.^p.e''^^'" 

> i^logEi^p^e"^^'-" (F.5) 

logE.^p.e''^- > _ (in^li^ + (1 - 7) max 

7^ j=i,...,_ft' 

where Inequality (|F.5I) which moves the expectation sign inside the exponential 
can be viewed as an infinite dimensional Jensen's inequality (see Lemma 3.2 of 
[|9l). For a deterministic adversary, we have proved 

^ f 1 7I log -/^^ 
Ei?„ = max Gj^n - E > gi^ t < h 7 max Gj „, 

i=l,...,K ^ — ^ r? i=l,...,K 



To get (IF. II) . we used that is an increasing function and that rjgi^t < ^ < 
For (EH), we used (1 - 7)^.^,,^^^^ < E.^^.^^^ = |l£ < ^^^^ ^^^^ = KE,.p,^,,i. 

7 

?! 

V ^7/„t > log E,.p, e"*^- (F.4) 



t=l 



V 




1 - 


7 


V 




1 - 


7 


V 



hence the first claimed result. 

The second result is trivial when ^JAK log -ft'/ (5n) > 1 since the upper bound 
is then larger than n. Otherwise, we have 7 = ^y AK log K / {5n) < 1 and Ar]K = 
57 so that the result follows from the first one. □ 
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Appendix G 

Experimental results for the min-max truncated 
estimator defined in Section 1333 



In Section [GTI we detail the different kinds of noises we work with. Then, Sec- 
tions lG.2llG.3l and lG.4l describe the three types of functional relationships between 
the input, the output and the noise involved in our experiments. A motivation for 
choosing these input-output distributions was the ability to compute exactly the 
excess risk, and thus to compare easily estimators. Section [G3] presents the ex- 
perimental results. 

G.l. Noise DISTRIBUTIONS 

In our experiments, we consider different types of noise that are centered and 
with unit variance: 

• the standard Gaussian noise: W ~ IN"(0, 1), 

• a heavy-tailed noise defined by: W = sign(\/)/|l^|^/^, with V ~ >[(0, 1) a 
standard Gaussian random variable and q = 2.01 (the real number q is taken 
strictly larger than 2 as for q = 2, the random variable W would not admit 
a finite second moment). 

• an asymmetric heavy-tailed noise defined by: 

ifV>0, 
— 2-7 otherwise, 

with q = 2.01 with V ~ ?\f(0, 1) a standard Gaussian random variable. 

• a mixture of a Dirac random variable with a low-variance Gaussian ran- 
dom variable defined by: with probability p, W = >/(l — p) /p, and with 
probability 1 — p,W is drawn from 

- P) P _ P(l -P) \ 

1 — p '1— p (1— pY J 

The parameter p E [p, 1] characterizes the part of the variance of W ex- 
plained by the Gaussian part of the mixture. Note that this noise admits 
exponential moments, but for n of order 1 /p, the Dirac part of the mixture 
generates low signal to noise points. 
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G.2. Independent normalized covariates (INC(n, d)) 

In INC(ri, d), the input-output pair is such that 

Y = {e*,X) + aW, 

where the components of X are independent standard normal distributions, 9* = 
(10,..., 10)^ GM^anda = 10. 

G.3. Highly correlated covariates (HCC(n,c/)) 

In HCC(?7,, d), the input-output pair is such that 

Y = {e*,X) + aW, 

where X is a multivariate centered normal Gaussian with covariance matrix Q 
obtained by drawing a {d, (i)-matrix A of uniform random variables in [0, 1] and 
by computing Q = AA^, 6* = (10, . . . , 10)^ G M"', and a = 10. So the only 
difference with the setting of Section [G^2] is the correlation between the covariates. 

G.4. Trigonometric series (TS(n,d)) 

Let X be a uniform random variable on [0, 1]. Let d be an even number. Let 

g{X) = ( cos(27rX), . . . , cos{d7iX), sin(27rX), . . . , sm{d7iX))^ . 
In TS(n, d), the input-output pair is such that 

Y = 20X^ - lOX - ^ + aW, 
with cr = 10. One can check that this implies 

20 20 10 10 Y Tad 

G.5. Results 

Tables [GH and [G^ give the results for the mixture noise. Tables [G3l IG.4I and 
IG.SI provide the results for the heavy-tailed noise and the standard Gaussian noise. 
Each line of the tables has been obtained after 1000 generations of the training 
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set. These results show that the min-max truncated estimator g is often equal to 
the ordinary least squares estimator g^°^^\ while it ensures impressive consistent 
improvements when it differs from g'°^^\ In this latter case, the number of points 
that are not considered in g, i.e. the number of points with low signal to noise 
ratio, varies a lot from 1 to 150 and is often of order 30. Note that not only the 
points that we expect to be considered as outliers (i.e. very large output points) 
are erased, and that these points seem to be taken out by local groups: see Figures 
IG.ll and rG.2l in which the erased points are marked by surrounding circles. 

Besides, the heavier the noise tail is (and also the larger the variance of the 
noise is), the more often the truncation modifies the initial ordinary least squares 
estimator, and the more improvements we get from the min-max truncated es- 
timator, which also becomes much more robust than the ordinary least squares 
estimator (see the confidence intervals in the tables). 



Table G.l: Comparison of the min-max truncated estimator g with the ordinary 
least squares estimator ^f^"'*^^ for the mixture noise (see Section [G?T1) with p = 0.1 
and p = 0.005. In parenthesis, the 95%-confidence intervals for the estimated 
quantities. 





nb of iterations 


.g 

o 

J) 
c 


"3 
<=» 

V 

.Q 

% 

O 
-§ 




1 

a 


1 


1 

"o 


INC(n=200,d=l) 


1000 


419 


405 


0.567(±0.083) 


0.178(±0.025) 


1.191(±0.178) 


0.262(±0.052) 


INC(n=200,d=2) 


1000 


506 


498 


1.055(±0.112) 


0.271(±0.030) 


1.884(±0.193) 


0.334(±0.050) 


HCC(n=200,d=2) 


1000 


502 


494 


1.045(±0.103) 


0.267(±0.024) 


1.866(±0.174) 


0.316(±0.032) 


TS(n=200,d=2) 


1000 


561 


554 


1.069(±0.089) 


0.310(±0.027) 


1.720(±0.132) 


0.367(±0.036) 


INC(n=1000,d=2) 


1000 


402 


392 


0.204(±0.015) 


0.109(±0.008) 


0.316(±0.029) 


0.081(±0.011) 


INC(n=1000,d=10) 


1000 


950 


946 


1.030(±0.041) 


0.228(±0.016) 


1.051(±0.042) 


0.207(±0.014) 


HCC(n=1000,d=10) 


1000 


942 


942 


0.980(±0.038) 


0.222(±0.015) 


1.008(±0.039) 


0.203(±0.015) 


TS(n=1000,d=10) 


1000 


976 


973 


1.009(±0.037) 


0.228(±0.017) 


1.018(±0.038) 


0.217(±0.016) 


INC(n=2000,d=2) 


1000 


209 


207 


0.104(±0.007) 


0.078(±0.005) 


0.206(±0.021) 


0.082(±0.012) 


HCC(n=2000,d=2) 


1000 


184 


183 


0.099(±0.007) 


0.076(±0.005) 


0.196(±0.023) 


0.070(±0.010) 


TS(n=2000,d=2) 


1000 


172 


171 


0.101(±0.007) 


0.080(±0.005) 


0.206(±0.020) 


0.083(±0.012) 


INC(n=2000,d=10) 


1000 


669 


669 


0.510(±0.018) 


0.206(±0.012) 


0.572(±0.023) 


0.117(±0.009) 


HCC(n=2000,d=10) 


1000 


669 


669 


0.499(±0.018) 


0.207(±0.013) 


0.561(±0.023) 


0.125(±0.011) 


TS(n=2000,d=10) 


1000 


754 


753 


0.516(±0.018) 


0.195(±0.013) 


0.558(it0.022) 


0.131(±0.011) 
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Table G.2: Comparison of the min-max truncated estimator g with the ordinary 
least squares estimator for the mixture noise (see Section [GJ]) with p = 0.4 
and p = 0.005. In parenthesis, the 95%-confidence intervals for the estimated 
quantities. 
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INC(n=200,d=l) 


1000 


234 


211 


0.551(±0.063) 


0.409(±0.042) 


1.211(±0.210) 


0.606(±0.110) 


INC(n=200,d=2) 


1000 


195 


186 


1.046(±0.088) 


0.788(±0.061) 


2.174(±0.293) 


0.848(±0.118) 


HCC(n=200,d=2) 


1000 


222 


215 


1.028(±0.079) 


0.748(±0.051) 


2.157(±0.243) 


0.897(±0.112) 


TS(n=200,d=2) 


1000 


291 


268 


1.053(±0.079) 


0.805(±0.058) 


1.701(±0.186) 


0.851(±0.093) 


INC(n=1000,d=2) 


1000 


127 


117 


0.201{±0.013) 


0.181{±0.012) 


0.366(±0.053) 


0.207(±0.035) 


INC(n= 10004= 10) 


1000 


262 


249 


1.023(±0.035) 


0.902{±0.030) 


1.238(±0.081) 


0.777(±0.054) 


HCC(n=1000,d=10) 


1000 


201 


192 


0.991(±0.033) 


0.902{±0.031) 


1.235(±0.088) 


0.790(±0.067) 


TS(n=1000,d=10) 


1000 


171 


162 


1.009(±0.033) 


0.951{±0.031) 


1.166(±0.098) 


0.825(±0.071) 


INC(n=2000,d=2) 


1000 


80 


77 


0.105(±0.007) 


0.099(±0.006) 


0.214(±0.042) 


0.135(±0.029) 


HCC(n=2000,d=2) 


1000 


44 


42 


0.102(±0.007) 


0.099(±0.007) 


0.187(±0.050) 


0.120(±0.034) 


TS(n=2000,d=2) 


1000 


47 


47 


0.101(±0.007) 


0.099(±0.007) 


0.147(±0.032) 


0.103(±0.026) 


INC(n=2000,d=10) 


1000 


116 


113 


0.511(±0.016) 


0.491(±0.016) 


0.611(±0.052) 


0.437(±0.042) 


HCC(n=2000,d=10) 


1000 


110 


105 


0.500{±0.016) 


0.481(±0.015) 


0.602(±0.056) 


0.430(±0.044) 


TS(n=2000,d=10) 


1000 


101 


98 


0.511{±0.016) 


0.499(±0.016) 


0.601(±0.054) 


0.486(±0.051) 
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Table G.3: Comparison of the min-max truncated estimator g with the ordinary 
least squares estimator g^°^^^ with the heavy-tailed noise (see Section [G7T]) . 
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INC(n=200,d=l) 


1000 


163 


145 


7.72(±3.46) 


3.92(±0.409) 


30.52(±20.8) 


7.20(±1.61) 


INC(n=200,d=2) 


1000 


104 


98 


22.69(±23.14) 


19.18(±23.09) 


45.36(±14.1) 


11.63(±2.19) 


HCC(n=200,d=2) 


1000 


120 


117 


18.16(±12.68) 


8.07(±0.718) 


99.39(±105) 


15.34(±4.41) 


TS(n=200,d=2) 


1000 


110 


105 


43.89(±63.79) 


39.71(±63.76) 


48.55(±18.4) 


10.59(±2.01) 


INC(n=1000,d=2) 


1000 


104 


100 


3.98(±2.25) 


1.78(±0.128) 


23.18(±21.3) 


2.03(±0.56) 


INC(n=1000,d=10) 


1000 


253 


242 


16.36(±5.10) 


7.90(±0.278) 


41.25(±19.8) 


7.81(±0.69) 


HCC(n=1000,d=10) 


1000 


220 


211 


13.57(±1.93) 


7.88(±0.255) 


33.13(±8.2) 


7.28(±0.59) 


TS(n=1000,d=10) 


1000 


214 


211 


18.67(±11.62) 


13.79(±11.52) 


30.34(±7.2) 


7.53(±0.58) 


INC(n=2000,d=2) 


1000 


113 


103 


1.56(±0.41) 


0.89(±0.059) 


6.74(±3.4) 


0.86(±0.18) 


HCC(n=2000,d=2) 


1000 


105 


97 


1.66(±0.43) 


0.95(±0.062) 


7.87(±3.8) 


1.13(±0.23) 


TS(n=2000,d=2) 


1000 


101 


95 


1.59(±0.64) 


0.88(±0.058) 


8.03(±6.2) 


1.04(±0.22) 


INC(n=2000,d=10) 


1000 


259 


255 


8.77(±4.02) 


4.23(±0.154) 


21.54(±15.4) 


4.03(±0.39) 


HCC(n=2000,d=10) 


1000 


250 


242 


6.98(±1.17) 


4.13(±0.127) 


15.35(±4.5) 


3.94(±0.25) 


TS(n=2000,d=10) 


1000 


238 


233 


8.49(±3.61) 


5.95(±3.486) 


14.82(±3.8) 


4.17(±0.30) 
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Table G.4: Comparison of the min-max truncated estimator g with the ordinary 
least squares estimator g^°^^^ with the asymmetric heavy-tailed noise (see Section 
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INC(n=200,d=l) 


1000 


87 


77 


5.49(±3.07) 


3.00(±0.330) 


35.44(±34.7) 


6.85(±2.48) 


INC(n=200,d=2) 


1000 


70 


66 


19.25(±23.23) 


17.4(±23.2) 


37.95(±13.1) 


11.05(±2.87) 


HCC(n=200,d=2) 


1000 


67 


66 


7.19(±0.88) 


5.81(±0.397) 


31.52(±10.5) 


10.87(±2.64) 


TS(n=200,d=2) 


1000 


76 


68 


39.80(±64.09) 


37.9(±64.1) 


34.28(±14.8) 


9.21(±2.05) 


INC(n=1000,d=2) 


1000 


101 


92 


2.81(±2.21) 


1.31(±0.106) 


16.76(±21.8) 


1.88(±0.69) 


INC(n=1000,d=10) 


1000 


211 


195 


10.71(±4.53) 


5.86(±0.222) 


29.00(±21.3) 


6.03(±0.71) 


HCC(n=1000,d=10) 


1000 


197 


185 


8.67(±1.16) 


5.81(±0.177) 


20.31(±5.59) 


5.79(±0.43) 


TS(n=1000,d=10) 


1000 


258 


233 


13.62(±11.27) 


11.3(±11.2) 


14.68(±2.45) 


5.60(±0.36) 


INC(n=2000,d=2) 


1000 


106 


92 


1.04(±0.37) 


0.64(±0.042) 


4.54(±3.45) 


0.79{it0.16) 


HCC(n=2000,d=2) 


1000 


99 


90 


0.90(±0.11) 


0.66(±0.042) 


3.23(±0.93) 


0.82(±0.16) 


TS(n=2000,d=2) 


1000 


84 


81 


1.11(±0.66) 


0.60(±0.042) 


6.80(±7.79) 


0.69{±0.17) 


INC(n=2000,d=10) 


1000 


238 


222 


6.32(±4.18) 


3.07(±0.147) 


16.84(±17.5) 


3.18(±0.51) 


HCC(n=2000,d=10) 


1000 


221 


203 


4.49(±0.98) 


2.98(±0.091) 


9.76(±4.39) 


2.93(±0.22) 


TS(n=2000,d=10) 


1000 


412 


350 


5.93(±3.51) 


4.59(±3.44) 


6.07(±1.76) 


2.84{±0.16) 
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Table G.5: Comparison of the min-max truncated estimator g with the ordinary 
least squares estimator g^°^^ for standard Gaussian noise. 





nb of iter. 
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INC(n=200,d=l) 


1000 


20 


8 


0.541 (±0.048) 


0.541(±0.048) 


0.401(±0.168) 


0.397(±0.167) 


INC(n=200,d=2) 


1000 


1 





1.051(±0.067) 


1.051(±0.067) 


2.566 


2.757 


HCC(n=200,d=2) 


1000 


1 





1.051(±0.067) 


1.051(±0.067) 


2.566 


2.757 


TS(n=200,d=2) 


1000 








1.068(±0.067) 


1.068(±0.067) 






INC(n=1000,d=2) 


1000 








0.203(±0.013) 


0.203(±0.013) 






INC(n=1000,d=10) 


1000 








1.023(±0.029) 


1.023(±0.029) 






HCC(n=1000,d=10) 


1000 








1.023(±0.029) 


1.023(±0.029) 






TS(n=1000,d=10) 


1000 








0.997(±0.028) 


0.997(±0.028) 






INC(n=2000,d=2) 


1000 








0.112(±0.007) 


0.112(±0.007) 






HCC(n=2000,d=2) 


1000 








0.112(±0.007) 


0.112(±0.007) 






TS(n=2000,d=2) 


1000 








0.098(±0.006) 


0.098(±0.006) 






INC(n=2000,d=10) 


1000 








0.517(±0.015) 


0.517(±0.015) 






HCC(n=2000,d=10) 


1000 








0.517(±0.015) 


0.517(±0.015) 






TS(n=2000,d=10) 


1000 








0.501(±0.015) 


0.501(±0.015) 
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Figure G. 1 : Surrounding points are the points of the training set generated several 
times from TS'(1000, 10) (with the mixture noise with p = 0.005 and p = 0.4) 
that are not taken into account in the min-max truncated estimator (to the extent 
that the estimator would not change by removing simultaneously all these points). 
The min-max truncated estimator x i— )■ f{x) appears in dash-dot line, while x i— t- 
E(F|X = x) is in solid line. In these six simulations, it outperforms the ordinary 
least squares estimator. 
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Figure G.2: Surrounding points are the points of the training set generated several 
times from TS (200, 2) (with the heavy-tailed noise) that are not taken into account 
in the min-max truncated estimator (to the extent that the estimator would not 
change by removing these points). The min-max truncated estimator x i— )■ /(x) 
appears in dash-dot line, while x (-)■ E(F|X = x) is in solid line. In these six 
simulations, it outperforms the ordinary least squares estimator. Note that in the 
last figure, it does not consider 64 points among the 200 training points. 
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